'
type: GLMProcessor
# ==== dataset config ====
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "label", "position_ids", "attention_mask"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 1
repeat: 1
numa_enable: False
prefetch_size: 1
seed: 0
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "label"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 1
repeat: 1
numa_enable: False
prefetch_size: 1
seed: 0
eval_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *eval_dataset
# ==== runner config ====
runner_config:
epochs: 1
batch_size: 1
sink_mode: True
sink_size: 4
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 4294967296
scale_factor: 2
scale_window: 1000
use_clip_grad: True
# lr sechdule
lr_schedule:
type: polynomial
learning_rate: 5.e-5
lr_end: 1.e-6
warmup_steps: 2000
total_steps: -1 # -1 means it will load the total steps of the dataset
# optimizer
optimizer:
type: FusedAdamWeightDecay
beta1: 0.9
beta2: 0.95
eps: 1.e-8
weight_decay: 0.1
# parallel config
use_parallel: False
parallel:
parallel_mode: 0 # 0-dataset, 1-semi, 2-auto, 3-hybrid
gradients_mean: False
loss_repeated_mean: True
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: False # optimizer shard
strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
parallel_config:
data_parallel: 1
model_parallel: 1
pipeline_stage: 1
expert_parallel: 1
optimizer_shard: False # optimizer shard
micro_batch_num: 1
vocab_emb_dp: True
gradient_aggregation_group: 4
micro_batch_interleave_num: 1
# moe
moe_config:
expert_num: 1
capacity_factor: 1.05
aux_loss_factor: 0.05
num_experts_chosen: 1
# recompute
recompute_config:
recompute: False
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: False
# autotune
auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10
# profile
profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: True
profile_communication: True
profile_memory: True
# callbacks
callbacks:
- type: MFLossMonitor
- type: SummaryMonitor
keep_default_action: True
- type: CheckpointMointor
prefix: "glm-6b-lora"
save_checkpoint_steps: 500
keep_checkpoint_max: 2
integrated_save: False
async_save: False
- type: ObsMonitor
keep_last: False
eval_callbacks:
- type: ObsMonitor
keep_last: False
================================================
FILE: llm-localization/ascend/mindformers/env.md
================================================
```
docker pull --platform=arm64 swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.0_mindspore2.2.11:aarch_20240125
```
```
docker run -it -u root \
--ipc=host \
--network=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /var/log/npu/:/usr/slog \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--name mindformers_dev \
swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.0_mindspore2.2.11:aarch_20240125 \
/bin/bash
```
```
git clone -b dev https://gitee.com/mindspore/mindformers.git
cd mindformers
bash build.sh
```
================================================
FILE: llm-localization/ascend/mindformers/llama/README.md
================================================
## 训练
================================================
FILE: llm-localization/ascend/mindformers/qwen/qwen1训练.md
================================================
- https://gitee.com/mindspore/mindformers/blob/r1.0/research/qwen/qwen.md
================================================
FILE: llm-localization/ascend/mindformers/qwen/run_qwen_7b.yaml
================================================
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: ''
src_strategy_path_or_dir: ''
auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'qwen_7b'
# dataset
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "labels", "attention_mask"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 4
repeat: 1
numa_enable: False
prefetch_size: 1
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
# runner config
runner_config:
epochs: 5
batch_size: 1
sink_mode: True
sink_size: 2
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 65536
scale_factor: 2
scale_window: 1000
use_clip_grad: True
# optimizer
optimizer:
type: FP32StateAdamWeightDecay
beta1: 0.9
beta2: 0.95
eps: 1.e-6
weight_decay: 0.1
# lr sechdule
lr_schedule:
type: CosineWithWarmUpLR
learning_rate: 1.e-5
warmup_ratio: 0.01
total_steps: -1 # -1 means it will load the total steps of the dataset
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "qwen"
save_checkpoint_steps: 10000
keep_checkpoint_max: 3
integrated_save: False
async_save: False
- type: ObsMonitor
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 8
model_parallel: 1
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: False
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: False
select_recompute: False
parallel_optimizer_comm_recompute: False
mp_comm_recompute: False
recompute_slice_activation: False
model:
model_config:
type: QwenConfig
batch_size: 1
seq_length: 1024
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 151936
intermediate_size: 11008
rms_norm_eps: 1.0e-6
emb_dropout_prob: 0.0
eos_token_id: 151643
pad_token_id: 151643
compute_dtype: "float16"
layernorm_compute_type: "float32"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float32"
use_past: True
use_flash_attention: False
use_paged_attention: False # only supported in mslite inference
block_size: 32
num_blocks: 128
is_dynamic: False
use_kvcache_op: False
offset: 0
checkpoint_name_or_path: "/path/qwen_7b_base.ckpt"
repetition_penalty: 1
max_decode_length: 512
top_k: 0
top_p: 0.8
do_sample: False
# configuration items copied from Qwen
rotary_pct: 1.0
rotary_emb_base: 10000
kv_channels: 128
arch:
type: QwenForCausalLM
processor:
return_tensors: ms
tokenizer:
model_max_length: 8192
vocab_file: "/path/qwen.tiktoken"
pad_token: "<|endoftext|>"
type: QwenTokenizer
type: QwenProcessor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
ascend_config:
precision_mode: "must_keep_origin_dtype"
max_call_depth: 10000
max_device_memory: "58GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: True
strategy_ckpt_config:
save_file: "./ckpt_strategy.ckpt"
only_trainable_params: False
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# aicc
remote_save_url: "Please input obs url on AICC platform."
================================================
FILE: llm-localization/ascend/mindformers/qwen/run_qwen_7b_910b.yaml
================================================
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: ''
src_strategy_path_or_dir: ''
auto_trans_ckpt: True # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'finetune'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'qwen_7b'
# dataset
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "labels", "attention_mask"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 4
repeat: 1
numa_enable: False
prefetch_size: 1
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
# runner config
runner_config:
epochs: 5
batch_size: 1
sink_mode: True
sink_size: 2
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 65536
scale_factor: 2
scale_window: 1000
use_clip_grad: True
# optimizer
optimizer:
type: FP32StateAdamWeightDecay
beta1: 0.9
beta2: 0.95
eps: 1.e-6
weight_decay: 0.1
# lr sechdule
lr_schedule:
type: CosineWithWarmUpLR
learning_rate: 1.e-5
warmup_ratio: 0.01
total_steps: -1 # -1 means it will load the total steps of the dataset
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "qwen"
save_checkpoint_steps: 10000
keep_checkpoint_max: 3
integrated_save: False
async_save: False
- type: ObsMonitor
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 4
model_parallel: 2
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: True
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: False
select_recompute: False
parallel_optimizer_comm_recompute: False
mp_comm_recompute: False
recompute_slice_activation: False
model:
model_config:
type: QwenConfig
batch_size: 1
seq_length: 512
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 151936
intermediate_size: 11008
rms_norm_eps: 1.0e-6
emb_dropout_prob: 0.0
eos_token_id: 151643
pad_token_id: 151643
compute_dtype: "float16"
layernorm_compute_type: "float32"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float16"
use_past: True
use_flash_attention: False
use_paged_attention: False # only supported in mslite inference
block_size: 32
num_blocks: 128
is_dynamic: False
use_kvcache_op: False
offset: 0
checkpoint_name_or_path: "/path/qwen_7b_base.ckpt"
repetition_penalty: 1
max_decode_length: 512
top_k: 0
top_p: 0.8
do_sample: False
# configuration items copied from Qwen
rotary_pct: 1.0
rotary_emb_base: 10000
kv_channels: 128
arch:
type: QwenForCausalLM
processor:
return_tensors: ms
tokenizer:
model_max_length: 8192
vocab_file: "/path/qwen.tiktoken"
pad_token: "<|endoftext|>"
type: QwenTokenizer
type: QwenProcessor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
ascend_config:
precision_mode: "must_keep_origin_dtype"
max_call_depth: 10000
max_device_memory: "30GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: True
strategy_ckpt_config:
save_file: "./ckpt_strategy.ckpt"
only_trainable_params: False
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# aicc
remote_save_url: "Please input obs url on AICC platform."
================================================
FILE: llm-localization/ascend/mindformers/qwen1.5/qwen1.5训练.md
================================================
- https://gitee.com/mindspore/mindformers/blob/r1.0/research/qwen1_5/qwen1_5.md
docker pull swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.1_mindspore2.3rc2:20240511
================================================
FILE: llm-localization/ascend/mindformers/qwen1.5/run_qwen1_5_7b_finetune.yaml
================================================
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: ''
src_strategy_path_or_dir: ''
auto_trans_ckpt: True # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
run_mode: 'finetune'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'qwen2_7b'
# runner config
runner_config:
epochs: 5
batch_size: 1
sink_mode: True
sink_size: 2
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 4096
scale_factor: 2
scale_window: 1000
use_clip_grad: True
# optimizer
optimizer:
type: AdamWeightDecayX
beta1: 0.9
beta2: 0.95
eps: 1.e-8
learning_rate: 1.e-6
weight_decay: 0.01
# lr sechdule
lr_schedule:
type: CosineWithWarmUpLR
learning_rate: 1.e-6
warmup_ratio: 0.01
total_steps: -1 # -1 means it will load the total steps of the dataset
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "qwen2"
save_checkpoint_steps: 1000
keep_checkpoint_max: 1
integrated_save: False
async_save: False
- type: ObsMonitor
# recompute config
recompute_config:
recompute: True
select_recompute: False
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: True
model:
model_config:
type: LlamaConfig
batch_size: 1
seq_length: 512
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 151936
intermediate_size: 11008
qkv_has_bias: True
rms_norm_eps: 1.0e-5
theta: 1000000.0
emb_dropout_prob: 0.0
eos_token_id: 151643
pad_token_id: 151643
compute_dtype: "float16"
layernorm_compute_type: "float32"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float16"
use_past: True
use_flash_attention: False
use_past_shard: False
offset: 0
checkpoint_name_or_path: ""
repetition_penalty: 1
max_decode_length: 512
top_k: 0
top_p: 0.8
do_sample: False
compute_in_2d: True
# configuration items copied from Qwen
rotary_pct: 1.0
rotary_emb_base: 1000000
kv_channels: 128
arch:
type: LlamaForCausalLM
processor:
return_tensors: ms
tokenizer:
model_max_length: 32768
vocab_file: "/{path}/vocab.json"
merges_file: "/{path}/merges.txt"
unk_token: "<|endoftext|>"
eos_token: "<|endoftext|>"
pad_token: "<|endoftext|>"
type: Qwen2Tokenizer
type: Qwen2Processor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
ascend_config:
precision_mode: "must_keep_origin_dtype"
max_call_depth: 10000
max_device_memory: "30GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
use_parallel: True
# parallel context config
parallel:
parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: True
strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# default parallel of device num = 32 910B
parallel_config:
data_parallel: 4
model_parallel: 2
pipeline_stage: 1
use_seq_parallel: True
micro_batch_num: 8
vocab_emb_dp: False
gradient_aggregation_group: 8
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# dataset
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "labels", "attention_mask"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 4
repeat: 1
numa_enable: False
prefetch_size: 1
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
do_eval: False
eval_step_interval: -1 # num of step intervals between each eval, -1 means no step end eval.
eval_epoch_interval: 50 # num of epoch intervals between each eval, 1 means eval on every epoch end.
# eval dataset
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: False
input_columns: ["input_ids", "target_ids", "attention_mask"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: False
repeat: 1
numa_enable: False
prefetch_size: 1
eval_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *eval_dataset
auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10
profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
layer_scale: False
layer_decay: 0.65
lr_scale_factor: 256
# aicc
remote_save_url: "Please input obs url on AICC platform."
================================================
FILE: llm-localization/ascend/mindformers/qwen1.5/run_qwen1_5_7b_infer.yaml
================================================
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: ''
src_strategy_path_or_dir: ''
auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'predict'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'qwen2_7b'
# runner config
runner_config:
epochs: 5
batch_size: 1
sink_mode: True
sink_size: 2
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 4294967296
scale_factor: 2
scale_window: 1000
use_clip_grad: True
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "qwen"
save_checkpoint_steps: 10000
keep_checkpoint_max: 1
integrated_save: False
async_save: False
- type: ObsMonitor
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 1
model_parallel: 4
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: False
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: True
select_recompute: False
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: True
model:
model_config:
type: LlamaConfig
batch_size: 1
seq_length: 8192
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 151936
intermediate_size: 11008
qkv_has_bias: True
rms_norm_eps: 1.0e-5
theta: 1000000.0
emb_dropout_prob: 0.0
eos_token_id: 151643
pad_token_id: 151643
compute_dtype: "float16"
layernorm_compute_type: "float32"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float16"
use_past: True
use_flash_attention: False
use_past_shard: False
offset: 0
checkpoint_name_or_path: ""
repetition_penalty: 1
max_decode_length: 512
top_k: 0
top_p: 0.8
do_sample: False
compute_in_2d: True
arch:
type: LlamaForCausalLM
processor:
return_tensors: ms
tokenizer:
model_max_length: 32768
vocab_file: "/{path}/vocab.json"
merges_file: "/{path}/merges.txt"
unk_token: "<|endoftext|>"
eos_token: "<|endoftext|>"
pad_token: "<|endoftext|>"
type: Qwen2Tokenizer
type: Qwen2Processor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
ascend_config:
precision_mode: "must_keep_origin_dtype"
max_call_depth: 10000
max_device_memory: "58GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: True
strategy_ckpt_config:
save_file: "./ckpt_strategy.ckpt"
only_trainable_params: False
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
================================================
FILE: llm-localization/ascend/mindformers/trick.md
================================================
```
global_batch_size = batch_size * data_parallel * micro_batch_num * micro_batch_interleave_num = 16 = 2 * 1 * 8 * 1).
batch_size : 数据批次大小
micro_batch_num:流水线并行的微批次大小。pipeline_satge大于1时,开启流水并行时使用,此处需满足micro_batch_num >= pipeline_satge
micro_batch_interleave_num: batch_size的拆分份数,多副本并行开关,通常在模型并行时使用,用于优化model_parallel时产生的通信损耗,纯流水并行时不建议使用。
# compute throughput (samples/s/p) 每一步每一卡每一秒能处理的样本数
throughput = self.global_batch_size / self.device_num / (per_step_seconds / 1000)
```
deepspeed:
global_train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs
================================================
FILE: llm-localization/ascend/mindformers/权重格式转换.md
================================================
## 模型格式权重转换
- [离线权重转换](https://gitee.com/mindspore/mindformers/blob/r1.0/docs/feature_cards/Transform_Ckpt.md)
- https://gitee.com/mindspore/mindformers/blob/r1.0/mindformers/tools/transform_ckpt.py
- https://gitee.com/mindspore/mindspore/blob/v2.2.14/mindspore/python/mindspore/parallel/checkpoint_transform.py
- https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/model_transformation.html
ConvertWeight支持对torch权重和mindspore权重的格式互转
- https://gitee.com/mindspore/mindformers/blob/dev/docs/feature_cards/Convert_Weight.md
- https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/convert_weight.py
- https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/convert_reversed.py
- https://gitee.com/mindspore/mindformers/blob/dev/mindformers/utils/convert_utils.py
- 统一转换入口:https://gitee.com/mindspore/mindformers/blob/dev/convert_weight.py
### 打印pytorch格式模型权重
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/workspace/Qwen-7B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/workspace/Qwen-7B-Chat", device_map="cpu", torch_dtype=torch.float16, trust_remote_code=True)
print(model)
for name, param in model.named_parameters():
print(name)
QWenLMHeadModel(
(transformer): QWenModel(
(wte): Embedding(151936, 4096)
(drop): Dropout(p=0.0, inplace=False)
(rotary_emb): RotaryEmbedding()
(h): ModuleList(
(0-31): 32 x QWenBlock(
(ln_1): RMSNorm()
(attn): QWenAttention(
(c_attn): Linear(in_features=4096, out_features=12288, bias=True)
(c_proj): Linear(in_features=4096, out_features=4096, bias=False)
(attn_dropout): Dropout(p=0.0, inplace=False)
)
(ln_2): RMSNorm()
(mlp): QWenMLP(
(w1): Linear(in_features=4096, out_features=11008, bias=False)
(w2): Linear(in_features=4096, out_features=11008, bias=False)
(c_proj): Linear(in_features=11008, out_features=4096, bias=False)
)
)
)
(ln_f): RMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)
transformer.wte.weight
transformer.h.0.ln_1.weight
transformer.h.0.attn.c_attn.weight
transformer.h.0.attn.c_attn.bias
transformer.h.0.attn.c_proj.weight
transformer.h.0.ln_2.weight
transformer.h.0.mlp.w1.weight
transformer.h.0.mlp.w2.weight
transformer.h.0.mlp.c_proj.weight
...
transformer.h.31.ln_1.weight
transformer.h.31.attn.c_attn.weight
transformer.h.31.attn.c_attn.bias
transformer.h.31.attn.c_proj.weight
transformer.h.31.ln_2.weight
transformer.h.31.mlp.w1.weight
transformer.h.31.mlp.w2.weight
transformer.h.31.mlp.c_proj.weight
transformer.ln_f.weight
lm_head.weight
```
## 打印mindspore格式模型权重
```
from mindformers.tools.register.config import MindFormerConfig
from mindformers.tools.utils import str2bool
from qwen_config import QwenConfig
from qwen_model import QwenForCausalLM
from qwen_tokenizer import QwenTokenizer
yaml_path = "/workspace/mindformers/research/qwen/run_qwen_7b_infer.yaml"
config = MindFormerConfig(yaml_path)
model_config = QwenConfig.from_pretrained(yaml_path)
model = QwenForCausalLM(model_config)
print(model)
QwenForCausalLM<
(transformer): QwenModel<
(wte): LlamaEmbedding<>
(drop): Dropout
(layers): CellList<
(0): QwenDecodeLayer<
(attention_norm): LlamaRMSNorm<>
(ffn_norm): LlamaRMSNorm<>
(attention): LLamaAttention<
(apply_rotary_emb): LlamaRotaryEmbedding<>
(wq): Linear<>
(wk): Linear<>
(wv): Linear<>
(wo): Linear<>
(kvcache_mgr): KVCacheMgr<>
>
(feed_forward): QwenFeedForward<
(silu): LlamaSiLU<>
(w1): Linear<>
(w2): Linear<>
(w3): Linear<>
>
>
...
(31): QwenDecodeLayer<
(attention_norm): LlamaRMSNorm<>
(ffn_norm): LlamaRMSNorm<>
(attention): LLamaAttention<
(apply_rotary_emb): LlamaRotaryEmbedding<>
(wq): Linear<>
(wk): Linear<>
(wv): Linear<>
(wo): Linear<>
(kvcache_mgr): KVCacheMgr<>
>
(feed_forward): QwenFeedForward<
(silu): LlamaSiLU<>
(w1): Linear<>
(w2): Linear<>
(w3): Linear<>
>
>
>
(freqs_mgr): FreqsMgr<>
(casual_mask): CausalMaskForQwen<>
(kvcache_preprocess): KVCachePreprocess<>
(ln_f): LlamaRMSNorm<>
>
(lm_head): Linear<>
(loss): CrossEntropyLoss<
(_softmax): _Softmax<>
(_nllloss): _NLLLoss<>
>
>
```
```
import mindspore as ms
model = ms.load_checkpoint("/workspace/qwen-7b-chat-ms/qwen_7b_ms.ckpt")
for name, value in model.items():
print(name)
transformer.wte.embedding_weight
transformer.layers.0.attention_norm.weight
transformer.layers.0.attention.wq.weight
transformer.layers.0.attention.wk.weight
transformer.layers.0.attention.wv.weight
transformer.layers.0.attention.wq.bias
transformer.layers.0.attention.wk.bias
transformer.layers.0.attention.wv.bias
transformer.layers.0.attention.wo.weight
transformer.layers.0.ffn_norm.weight
transformer.layers.0.feed_forward.w1.weight
transformer.layers.0.feed_forward.w3.weight
transformer.layers.0.feed_forward.w2.weight
...
transformer.layers.31.attention_norm.weight
transformer.layers.31.attention.wq.weight
transformer.layers.31.attention.wk.weight
transformer.layers.31.attention.wv.weight
transformer.layers.31.attention.wq.bias
transformer.layers.31.attention.wk.bias
transformer.layers.31.attention.wv.bias
transformer.layers.31.attention.wo.weight
transformer.layers.31.ffn_norm.weight
transformer.layers.31.feed_forward.w1.weight
transformer.layers.31.feed_forward.w3.weight
transformer.layers.31.feed_forward.w2.weight
transformer.ln_f.weight
lm_head.weight
```
## qwen1.5
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/workspace/Qwen1.5-7B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/workspace/Qwen1.5-7B-Chat", device_map="cpu", torch_dtype=torch.float16, trust_remote_code=True)
print(model)
for name, param in model.named_parameters():
print(name)
/home/guodong/Qwen-7B-Chat
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151936, 4096)
(layers): ModuleList(
(0-31): 32 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=True)
(k_proj): Linear(in_features=4096, out_features=4096, bias=True)
(v_proj): Linear(in_features=4096, out_features=4096, bias=True)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm()
(post_attention_layernorm): Qwen2RMSNorm()
)
)
(norm): Qwen2RMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)
for name, param in model.named_parameters():
... print(name)
...
model.embed_tokens.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.bias
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.bias
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.bias
model.layers.0.self_attn.o_proj.weight
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.down_proj.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
...
model.layers.31.self_attn.q_proj.weight
model.layers.31.self_attn.q_proj.bias
model.layers.31.self_attn.k_proj.weight
model.layers.31.self_attn.k_proj.bias
model.layers.31.self_attn.v_proj.weight
model.layers.31.self_attn.v_proj.bias
model.layers.31.self_attn.o_proj.weight
model.layers.31.mlp.gate_proj.weight
model.layers.31.mlp.up_proj.weight
model.layers.31.mlp.down_proj.weight
model.layers.31.input_layernorm.weight
model.layers.31.post_attention_layernorm.weight
model.norm.weight
lm_head.weight
```
```
import mindspore as ms
model = ms.load_checkpoint("/workspace/qwen1.5-7b-chat-ms/qwen_7b_ms.ckpt")
for name, value in model.items():
print(name)
model.tok_embeddings.embedding_weight
model.layers.0.attention.wq.weight
model.layers.0.attention.wq.bias
model.layers.0.attention.wk.weight
model.layers.0.attention.wk.bias
model.layers.0.attention.wv.weight
model.layers.0.attention.wv.bias
model.layers.0.attention.wo.weight
model.layers.0.feed_forward.w1.weight
model.layers.0.feed_forward.w3.weight
model.layers.0.feed_forward.w2.weight
model.layers.0.attention_norm.weight
model.layers.0.ffn_norm.weight
...
model.layers.31.attention.wq.weight
model.layers.31.attention.wq.bias
model.layers.31.attention.wk.weight
model.layers.31.attention.wk.bias
model.layers.31.attention.wv.weight
model.layers.31.attention.wv.bias
model.layers.31.attention.wo.weight
model.layers.31.feed_forward.w1.weight
model.layers.31.feed_forward.w3.weight
model.layers.31.feed_forward.w2.weight
model.layers.31.attention_norm.weight
model.layers.31.ffn_norm.weight
model.norm_out.weight
lm_head.weight
```
================================================
FILE: llm-localization/ascend/mindie/2.0.RC2/qwen.md
================================================
# README
- 千问(Qwen)大语言模型是阿里巴巴集团推出的大型语言模型,具备强大的自然语言处理能力,能够理解和生成文本,能够应用于智能客服、内容生成、问答系统等多个场景,助力企业智能化升级。
# 特性矩阵
- 下表展示Qwen模型各版本支持的特性
| 模型及参数量 | 800I A2 Tensor Parallelism | 300I DUO Tensor Parallelism | FP16 | BF16 | Flash Attention | Paged Attention | W8A8量化 | W8A16量化 | KV cache量化 | 稀疏量化(仅支持300I DUO) | MOE量化 | MindIE Service | TGI | 长序列 | prefix_cache | FA3量化 | functioncall | Multi LoRA| W4A16量化 |
| ----------------- |----------------------------|--------------------| ---- | ---- | --------------- | --------------- | -------- | --------- | ------------ | -------- | ------- | -------------- | --- | ------ | ---------- | --- | --- | --- |---------|
| Qwen2-57B-A14B | 支持world size 8 | × | × | √ | × | √ | × | × | × | × | × | √ | × | × | x | x | x | x | x |
| Qwen2-7B | 支持world size 1,2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | √ | × | × | √ | x | √ | × | × | x | x | x | x | x |
| Qwen2-72B | 支持world size 1,2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | √ | √ | √ | √ | × | √ | × | √ | √ | x | √ | √ | x |
| gte-Qwen2-7B | 支持world size 1,2,4 | × | √ | × | × | √ | × | × | × | × | × | × | × | × | x | x | x | x | x |
| Qwen2.5-0.5B | 支持world size 1,2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | × | × | × | × | × | × | × | × | x | x | x | x | x |
| Qwen2.5-1.5B | 支持world size 1,2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | × | × | × | × | × | × | × | × | x | x | x | x | x |
| Qwen2.5-7B | 支持world size 1,2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | √ | × | × | √ | × | √ | × | × | √ | x | √ | x | x |
| Qwen2.5-14B | 支持world size 2,4,8 | 支持world size 2,4,8 | √ | √ | × | √ | √ | × | × | √ | × | √ | × | × | x | x | √ | x | x |
| Qwen2.5-32B | 支持world size 4,8 | × | √ | √ | × | √ | √ | × | × | × | × | √ | × | × | x | x | √ | x | x |
| Qwen2.5-72B | 支持world size 8 | × | √ | √ | × | √ | × | × | × | × | × | √ | × | × | x | √ | √ | x | √ |
| QwenCode2.5-7B | × | 支持world size 2,4,8 | √ | × | × | √ | × | × | × | √ | x | √ | × | × | √ | x | x | x | x |
| QwenCode2.5-32B | 支持world size 4,8 | x | × | √ | × | √ | × | × | × | x | x | x | × | × | x | x | x | x | x |
注:表中所示支持的world size为对话测试可跑通的配置,实际运行时还需考虑输入序列长度带来的显存占用。
- 模型支持的张量并行维度(Tensor Parallelism)可以通过查看模型的`config.json`文件中的 **KV头的数量** (`num_key_value_heads` 或者类似字段)来判断模型支持多少维度的张量并行。
> 例如 `Qwen2-0.5B` 的 `"num_key_value_heads"` 为 2,表明其只支持world size 1,2
> 例如 `Qwen2.5-32B` 的 `"num_key_value_heads"` 为 8,表明其理论支持world size 1,2,4,8(不考虑显存占用)
- qwen2/2.5系列模型在800I A2仅支持bfloat16浮点类型,300I DUO仅支持float16浮点类型。
## 开源权重
#### Qwen2
- [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B/tree/main)
- [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct/tree/main)
- [gte-Qwen2-7B-Instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
- [Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct/tree/main)
- [Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B/tree/main)
- [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct/tree/main)
#### Qwen2.5
- [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/tree/main)
- [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/tree/main)
- [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/tree/main)
- [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct/tree/main)
- [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct/tree/main)
- [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/tree/main)
#### Qwen2.5-Coder
- [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/tree/main)
- [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct/tree/main)
# 版本配套
下表展示运行各个系列Qwen模型所需要的transformers版本
| 模型版本 | transformers版本 |
| -------- | ---------------- |
| Qwen2 | 4.40.1及以上 |
| Qwen2.5 | 4.43.1 |
| Qwen2.5-Coder | 4.43.1及以上 |
# Paged Attention 推理使用说明
## 推理须知:
- Qwen模型权重所在路径中的config.json文件需添加字段`torch_dtype`,例如`"torch_dtype": "float16"`
- 执行量化推理时,须在量化权重所在路径的config.json文件中添加字段`quantize`,值为当前量化权重的量化方式,例如`"quantize": "w8a8"`、`"quantize": "w8a16"`
- QWen-14B执行[2k,32k](QWen-7B为[8k,32k])长序列推理时需增加环境变量`LONG_SEQ_ENABLE=1`。长序列推理过程具有更多计算节点,因此相比于短序列,推理性能将有下降。
- Qwen2-7B建议采用`bf16`格式,即其权重所在路径中的config.json文件字段`torch_dtype`保持为`bfloat16`
- 300I DUO只支持`"torch_dtype": "float16"`
- 稀疏量化w8a8sc仅支持在300I DUO上使用
- 稀疏量化分为两个步骤。步骤一:w8a8s 可在任何机器上生成,注意config中需要将"torch_dtype"改为"float16"。800I A2机器上可以使用多卡进行量化生成w8a8s权重。300I DUO上仅支持单卡或cpu生成w8a8s权重。步骤二:w8a8sc 需要在300I DUO上切分。
## 路径变量解释
| 变量名称 | 含义 |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| working_dir | 加速库及模型库下载后放置的目录 |
| llm_path | 模型仓所在路径。若使用编译好的包,则路径为`${working_dir}/MindIE-LLM/`;若使用gitee下载的代码,则路径为`${working_dir}/MindIE-LLM/examples/atb_models` |
| script_path | 脚本所在路径。QWen系列模型的工作脚本所在路径为`${llm_path}/examples/models/qwen` |
| weight_path | 模型权重路径 |
## 权重格式转换
Paged Attention 场景需要.safetensors格式的权重,如果没有,参考[此README文件](../../README.md)转换
注:huggingface官网给出的QWen模型权重为.safetensors格式
## 量化
量化权重可通过msmodelslim(昇腾压缩加速工具)实现。
### 环境准备
环境配置可参考[此README文件](../../../README.md)
- 设置环境变量
```shell
# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
需要安装 CANN(已包含msmodelslim工具) 以及 pytorch 和 pytorch-npu
以及相关的python库
```shell
pip install transformers # transformers版本应根据Qwen版本确定,配套关系见‘版本配套’
pip install accelerate==0.27.2
pip install scipy==1.11.4
pip install tiktoken==0.5.2
pip install einops==0.7.0
pip install transformers_stream_generator==0.0.4
```
### 导出量化权重
#### qwen2-7b、qwen2.5-7b、qwen2.5-14b、qwen2.5-32b W8A8量化
- W8A8量化权重请使用以下指令生成
- 当前支持NPU分布式W8A8量化
- 执行量化脚本
```shell
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
注: 安装完cann后 需要执行source set_env.sh 声明ASCEND_HOME_PATH值 后续安装msmodelslim前需保证其不为空
# 执行"jq --version"查看是否安装jq,若返回"bash:jq:command not found",则依次执行"apt-get update"和"apt install jq"
jq --version
cd ${llm_path}
# 指定当前机器上可用的逻辑NPU核心 通过修改convert_quant_weight.sh文件中export ASCEND_RT_VISIBLE_DEVICES值 指定使用卡号及数量
# 7b系列使用单卡 14b 32b使用4卡 eg: ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
vi examples/models/qwen/convert_quant_weight.sh
# 生成量化权重
bash examples/models/qwen/convert_quant_weight.sh -src {浮点权重路径} -dst {W8A8量化权重路径} -type qwen_w8a8
```
#### qwen2-7b、qwen2.5-14b、qwen2.5-7b 稀疏量化
- Step 1
- 修改模型权重config.json中`torch_dtype`字段为`float16`
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
注: 安装完cann后 需要执行source set_env.sh 声明ASCEND_HOME_PATH值 后续安装msmodelslim前需保证其不为空
```shell
# 执行"jq --version"查看是否安装jq,若返回"bash:jq:command not found",则依次执行"apt-get update"和"apt install jq"
jq --version
# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd ${llm_path}
# 指定当前机器上可用的逻辑NPU核心 通过修改convert_quant_weight.sh文件中export ASCEND_RT_VISIBLE_DEVICES值 指定使用卡号及数量
# 7b系列使用单卡 14b 32b使用4卡 eg: ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
vi examples/models/qwen/convert_quant_weight.sh
bash examples/models/qwen/convert_quant_weight.sh -src {浮点权重路径} -dst {W8A8量化权重路径} -type qwen_w4a8
```
- Step 2:量化权重切分及压缩
```shell
export IGNORE_INFER_ERROR=1
torchrun --nproc_per_node {TP数} -m examples.convert.model_slim.sparse_compressor --model_path {W8A8S量化权重路径} --save_directory {W8A8SC量化权重路径} --multiprocess_num 4
```
- TP数为tensor parallel并行个数
- 注意:若权重生成时以TP=4进行切分,则运行时也需以TP=4运行
- 示例
```shell
torchrun --nproc_per_node 2 -m examples.convert.model_slim.sparse_compressor --model_path /data1/weights/model_slim/Qwen-14b_w8a8s --save_directory /data1/weights/model_slim/Qwen-14b_w8a8sc
```
#### Qwen2-72B W8A16量化
- 假设当前位于`${llm_path}`目录下(安装的默认路径为`/usr/local/Ascend/llm_model`)
- 目录`examples/models/qwen/`下的`quant_qwen2_w8a16_fast.py`为Qwen2-72B-W8A16模型已配置好的较优的量化策略。导出量化权重时可直接使用,也可修改为其它策略。
- 通过 `${llm_path}/examples/models/qwen/convert_quant_weight.sh` 脚本导出Qwen2-72B模型W8A16的量化权重(注意量化权重不要和浮点权重放在同一个目录下)。命令如下:
```shell
cd ${llm_path}
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${量化权重保存路径} -type qwen_w8a16
```
例:
```shell
bash examples/models/qwen/convert_quant_weight.sh -src /data1/models/Qwen2_72B -dst /data1/models/Qwen2_72B_W8A16 -type qwen_w8a16
```
- 导出量化权重后生成`quant_model_weight_w8a16.safetensors`和`quant_model_description_w8a16.json`两个文件。模型浮点权重中的其他文件(除safetensors文件外)需要手工拷贝到目标量化文件夹中。
- 在量化权重保存路径中的config.json文件中添加"quantize"字段。对于W8A16量化,"quantize"字段的值为"w8a16"。
- 在量化权重保存路径中的config.json文件中添加"quantization"字段,其值为'{"group_size": 0}'
- "group_size"为0时代表W8A16使用的是per channel量化
#### Qwen2-72B KV Cache量化
- 假设当前位于`${llm_path}`目录下(安装的默认路径为`/usr/local/Ascend/llm_model`)
- 使用下列命令进行kv-int8量化权重导出:
```shell
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${量化权重保存路径} -type qwen2_72b_w8a8c8 -device_type npu -use_devices 0,1,2,3,4,5,6,7
```
- 与Qwen2-72B W8A16量化不同,量化脚本已经替用户按要求修改好了config.json,用户无需再修改
#### Qwen2-72B W8A8量化
- 假设当前位于`${llm_path}`目录下(安装的默认路径为`/usr/local/Ascend/llm_model`)
- 使用下列命令进行W8A8量化权重导出:
- Step 1
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
注: 安装完cann后 需要执行source set_env.sh 声明ASCEND_HOME_PATH值 后续安装msmodelslim前需保证其不为空
```shell
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${量化权重保存路径} -type qwen2_72b_w8a8 -device_type npu -use_devices 0,1,2,3,4,5,6,7
```
- 与qwen2-72B W8A16量化不同,量化脚本已经替用户按要求修改好了config.json,用户无需再修改
#### Qwen2-72B W8A8稀疏量化
- 假设当前位于`${llm_path}`目录下(安装的默认路径为`/usr/local/Ascend/llm_model`)
- 使用下列命令进行W8A8稀疏量化权重导出:
- Step 1
- 修改模型权重config.json中`torch_dtype`字段为`float16`
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
```shell
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${W8A8SC量化权重路径} -type qwen2_72b_w8a8s -device_type npu -use_devices 0,1,2,3,4,5,6,7
```
- Step 2:切分及压缩权重
```shell
export IGNORE_INFER_ERROR=1
torchrun --nproc_per_node {TP数} -m examples.convert.model_slim.sparse_compressor --model_path {W8A8S量化权重路径} --save_directory {W8A8SC量化权重路径} --multiprocess_num 4
```
- TP数为tensor parallel并行个数
- 注意:若权重生成时以TP=4进行切分,则运行时也需以TP=4运行
- multiprocess_num必须设置为4以减小机器压力
- 示例
```shell
torchrun --nproc_per_node 2 -m examples.convert.model_slim.sparse_compressor --model_path /data1/weights/model_slim/Qwen-14b_w8a8s --save_directory /data1/weights/model_slim/Qwen-14b_w8a8sc --multiprocess_num 4
```
- 与qwen2-72B W8A16量化不同,量化脚本已经替用户按要求修改了config.json,用户只需要将config.json中的quant_type字段修改为"w8a8s"即可。
#### Qwen2.5-72B FA3量化
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
- 阅读链接中的readme文件生成权重,或者直接问msModelSlim团队索要:
https://gitee.com/ascend/msit/blob/master/msmodelslim/docs/FA%E9%87%8F%E5%8C%96%E4%BD%BF%E7%94%A8%E8%AF%B4%E6%98%8E.md
- ModelSlim团队会提供`quant_model_description_w8a8.json`和`quant_model_weight_w8a8.safetensors`两个文件。
- 通过 `${llm_path}/examples/models/qwen/convert_quant_weight.sh` 脚本导出Qwen2.5-72B模型FA3的量化权重(注意量化权重不要和浮点权重放在同一个目录下)。命令如下:
```shell
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${fa3量化权重路径} -type qwen2p5_fa3 -msmodelslim_path ${msmodelslim工具路径}
```
- ${msmodelslim工具路径}为下载目录/msit/msmodelslim,例如在/home目录下面下载的msmodelslim工具,则实际路径为:/home/msit/msmodelslim
- 示例
```shell
bash examples/models/qwen/convert_quant_weight.sh -src /opt/models/Qwen2.5-72B-Instruct/ -dst /opt/models/Qwen2.5-72B-fa3 -type qwen2p5_fa3 -msmodelslim_path /home/msit/msmodelslim
```
- 新版本的msmodelslim工具如果需要添加`anti_calib_file`参数,可以在上述命令中加入`-fa3_use_anti_calib True`
- 示例
```shell
bash examples/models/qwen/convert_quant_weight.sh -src /opt/models/Qwen2.5-72B-Instruct/ -dst /opt/models/Qwen2.5-72B-fa3 -type qwen2p5_fa3 -msmodelslim_path /home/msit/msmodelslim -fa3_use_anti_calib True
```
- 模型浮点权重中的其他文件(除safetensors文件外)需要手工拷贝到目标量化文件夹中。
- 拷贝好之后,用户需在`config.json`文件中手动添加以下两个字段:
```json
"quantize": "w8a8",
"quantization_config": {"fa_quant_type": "FAQuant"}
```
#### Qwen2.5-72B W4A16量化
- W8A8量化权重请使用以下指令生成
- 当前支持NPU分布式W8A8量化
- 执行量化脚本
```shell
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
注: 安装完cann后 需要执行source set_env.sh 声明ASCEND_HOME_PATH值 后续安装msmodelslim前需保证其不为空
# 执行"jq --version"查看是否安装jq,若返回"bash:jq:command not found",则依次执行"apt-get update"和"apt install jq"
jq --version
cd ${llm_path}
# 指定当前机器上可用的逻辑NPU核心 通过修改convert_quant_weight.sh文件中export ASCEND_RT_VISIBLE_DEVICES值 指定使用卡号及数量
# 72B使用8卡 eg: ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
vi examples/models/qwen/convert_quant_weight.sh
# 生成量化权重
bash examples/models/qwen/convert_quant_weight.sh -src {浮点权重路径} -dst {W8A8量化权重路径} -type qwen_w4a16
```
#### Qwen2.5-Coder-7B 稀疏量化
- Step 1
- 修改模型权重config.json中`torch_dtype`字段为`float16`
- 下载msmodelslim量化工具
- 下载地址为https://gitee.com/ascend/msit/tree/master/msmodelslim
- 根据msmodelslim量化工具readme进行相关操作
```shell
# 执行"jq --version"查看是否安装jq,若返回"bash:jq:command not found",则依次执行"apt-get update"和"apt install jq"
jq --version
# 设置CANN包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd ${llm_path}
# Qwen2.5-Coder-7B加载到cpu上生成量化权重 注释掉convert_quant_weight.sh里export ASCEND_RT_VISIBLE_DEVICES这一行
vi examples/models/qwen/convert_quant_weight.sh
bash examples/models/qwen/convert_quant_weight.sh -src {浮点权重路径} -dst {W8A8量化权重路径} -type qwencode_w8a8s -device_type cpu
```
稀疏量化后的"quantize"类型为w8a8s
- Step 2:量化权重切分及压缩
```shell
export IGNORE_INFER_ERROR=1
torchrun --nproc_per_node {TP数} -m examples.convert.model_slim.sparse_compressor --model_path {W8A8S量化权重路径} --save_directory {W8A8SC量化权重路径}
```
- TP数为tensor parallel并行个数
- 注意:若权重生成时以TP=4进行切分,则运行时也需以TP=4运行
- 示例
```shell
torchrun --nproc_per_node 4 -m examples.convert.model_slim.sparse_compressor --model_path /data/Qwen2.5-Coder-7B-w8a8s --save_directory /data/Qwen2.5-Coder-7B-w8a8sc
```
#### Qwen2.5-14B Qwen2.5-72B pdmix W8A8C8量化
- 通过 `${llm_path}/examples/models/qwen/convert_quant_weight.sh` 脚本导出Qwen2.5-14B和Qwen2.5-72B模型pdmix W8A8C8的量化权重(注意量化权重不要和浮点权重放在同一个目录下)。命令如下:
- 假设当前位于`${llm_path}`目录下(安装的默认路径为`/usr/local/Ascend/atb-models`),`trust_remote_code`为可选参数代表是否信任本地的可执行文件,传入该参数代表信任本地可执行文件
```shell
bash examples/models/qwen/convert_quant_weight.sh -src ${浮点权重路径} -dst ${量化权重保存路径} -type ${type} -device_type npu -use_devices 0,1,2,3,4,5,6,7 -msmodelslim_path ${msmodelslim工具路径} -trust_remote_code
```
- ${msmodelslim工具路径}为下载目录/msit/msmodelslim,例如在/home目录下面下载的msmodelslim工具,则实际路径为:/home/msit/msmodelslim
Qwen2.5-72B示例:
```shell
bash examples/models/qwen/convert_quant_weight.sh -src /data/Qwen2.5-72B-Instruct/ -dst /data/qwen2.5-72B-pdmix-w8a8c8/ -type qwen2p5_72b_w8a8c8_pdmix -device_type npu -use_devices 0,1,2,3,4,5,6,7 -msmodelslim_path /home/msit/msmodelslim -trust_remote_code
```
Qwen2.5-14B示例:
```shell
bash examples/models/qwen/convert_quant_weight.sh -src /data/Qwen2.5-14B-Instruct/ -dst /data/qwen2.5-14B-pdmix-w8a8c8/ -type qwen2p5_14b_w8a8c8_pdmix -device_type npu -use_devices 0,1,2,3,4,5,6,7 -msmodelslim_path /home/msit/msmodelslim -trust_remote_code
```
## 推理
### 对话测试
量化权重生成路径下可能缺少一些必要文件(与转换量化权重时使用的cann版本有关),若启动量化推理失败,请将config.json等相关文件复制到量化权重路径中,可执行以下指令进行复制:
```shell
cp ${浮点权重路径}/*.py ${量化权重路径}
cp ${浮点权重路径}/*.json ${量化权重路径}
cp ${浮点权重路径}/*.tiktoken ${量化权重路径}
```
启动量化推理时,请在权重路径的config.json文件中添加(或修改)`torch_dtype`字段,例如`"torch_dtype": "float16"`。
启动量化推理时,请在权重路径的config.json文件中添加(或修改)`quantize`字段,值为相应量化方式,例如`"quantize": "w8a8"`、`"quantize": "w8a16"`
在`${llm_path}`目录执行以下指令
```shell
bash examples/models/qwen/run_pa.sh -m ${weight_path} --trust_remote_code true
```
注:
1.推理支持浮点和量化,若启动浮点推理则在`${weight_path}`中传入浮点权重路径,若启动量化则传入量化权重路径
2.--trust_remote_code为可选参数代表是否信任本地的可执行文件,默认false。传入true,则代表信任本地可执行文件,-r为其缩写
3.同时支持Qwen和Qwen1.5模型推理,若启动Qwen模型推理时在`${weight_path}`中传入Qwen权重路径,若启动Qwen1.5模型推理时则在`${weight_path}`中传入Qwen1.5权重路径
4.Qwen系列chat模型需要开启chat模式才能正常输出。
执行:
```shell
bash examples/models/qwen/run_pa.sh -m ${weight_path} --trust_remote_code true -c true
```
5.对于embedding类模型,例如gte-Qwen2-7B-Instruct时,运行命令如下:
```shell
bash examples/models/qwen/run_pa.sh -m ${weight_path} -e true
```
6.启动qwen需要安装三方依赖tiktoken,若环境中没有该依赖可使用以下命令安装:
```shell
pip install tiktoken
```
**运行Multi-Lora**
- 下载Lora权重:Lora权重中需包含至少一个safetensors格式的文件,和一个名为`adapter_config.json`的配置文件
- 在基础模型的权重文件夹中,新增`lora_adapter.json`文件,内容为需要预加载的Lora权重,例如:
```json
{
"qwen_lora1": "/home/data/lora/Qwen1.5-14b-chat/adapter1",
"qwen_lora2": "/home/data/lora/Qwen1.5-14b-chat/adapter2"
}
```
- 进行推理时需指定每个请求所使用的adapter权重,默认仅使用基础模型权重
- 运行示例
```shell
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node 8 --master_port 20030 -m examples.run_pa --model_path /data1/models/qwen2/Qwen1.5-14b-chat --is_chat_model --max_output_length 256 --max_batch_size 2 --input_dict '[{"prompt": "What is deep learning?", "adapter": "qwen_lora1"}, {"prompt": "What is deep learning?"}]'
```
- 约束与限制
- 仅支持在Atlas 800I A2上运行
- Lora权重不支持热加载,如果未获取到`adapter_id`,将会默认使用`base`
- 仅支持浮点模型
- `lora_adapter.json`文件中的键就是`input_dict`参数的键`adapter`的值,也叫`adapter_id`。
- `adapter_id`唯一 且 不能与字符串`base`重名
- 在显存充足的情况下至多加载10个Lora权重
- **用于精度测试的`lora_data.jsonl`文件包含的`adapter_id`数量必须比数据集的数量多,否则多余的数据会默认使用base**
### run_pa.sh 参数说明(需要到脚本中修改)
根据硬件设备不同请参考下表修改run_pa.sh再运行
| 参数名称 | 含义 | 800I A2推荐值 | 300I DUO推荐值 |
| ------------------------- | ----------------------------------------- | ---------------- | ---------------- |
| BIND_CPU | 绑定CPU核心开关,默认进行绑核 | 1 | 1 |
| ASCEND_RT_VISIBLE_DEVICES | 使用的硬件卡号,多个卡间使用逗号相连 | 根据实际情况设置 | 根据实际情况设置 |
| RESERVED_MEMORY_GB | 保留内存,通常未加速库需要的内存+通信内存 | 3 | 3 |
| MASTER_PORT | 卡间通信端口,通常不用修改,有冲突时再改 | | |
注:暂不支持奇数卡并行
## 精度测试
- 参考[此README文件](../../../tests/modeltest/README.md)
示例:
```shell
bash run.sh pa_fp16 full_BoolQ 1 qwen /data1/models/qwen2/qwen_quant_test/ 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen-7b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen-14b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen-72b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-14b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen-14b-chat权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen-72b-chat权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-0.5b-chat权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-4b-chat权重路径} 4
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-7b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-14b-chat权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-32b-chat权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-72b权重路径} 8
bash run.sh pa_fp16 full_BoolQ 1 qwen ${Qwen1.5-MoE-A2.7B-Chat权重路径} 8
bash run.sh pa_fp16 full_HumanEval_X 1 qwen ${Qwen2.5-Coder-7B权重路径} 8
```
- gte_qwen测试
- 依赖安装
C_MTEB、optimum、tqdm、datasets、faiss-cpu
C_MTEB 需要安装依赖 pytrec-eval,安装该依赖时需要请求 https://github.com/usnistgov/trec_eval/archive/v9.0.8.tar.gz 时发生SSL证书错误
解决方案
手动下载 pytrec_eval并上传至服务器或在服务器上执行
```shell
wget https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c22745aa77e0799bba471c0a3a19/pytrec_eval-0.5.tar.gz
tar -xzvf pytrec_eval-0.5.tar.gz
cd pytrec_eval-0.5
```
修改 pytrec_eval-0.5/setup.py,在开头处增加
```shell
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
```
安装
```shell
tar -zcvf pytrec_eval-0.5.tar.gz pytrec_eval-0.5
pip install pytrec_eval-0.5.tar.gz
```
同理安装依赖 pytrec-eval
```shell
wget http://mirrors.aliyun.com/pypi/packages/dc/61/9003ffdb64f74a91208d69235dbcd380ae1a8d267089348eb8f7aab9819a/pytrec_eval_terrier-0.5.7.tar.gz
tar -xzvf pytrec_eval_terrier-0.5.7.tar.gz
cd pytrec_eval_terrier-0.5.7
```
修改 pytrec_eval_terrier-0.5.7/setup.py,在开头处增加
```shell
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
tar -zcvf pytrec_eval_terrier-0.5.7.tar.gz pytrec_eval_terrier-0.5.7
pip install pytrec_eval_terrier-0.5.7.tar.gz
```
安装
```shell
tar -zcvf pytrec_eval_terrier-0.5.7.tar.gz pytrec_eval_terrier-0.5.7
pip install pytrec_eval_terrier-0.5.7.tar.gz
```
- 获取测试数据集
```shell
mkdir dataset
```
下载数据集文件 [corpus、queries](https://huggingface.co/datasets/C-MTEB/T2Retrieval/tree/main/data) 及 [dev](https://huggingface.co/datasets/C-MTEB/T2Retrieval-qrels/tree/main/data) 至 `dataset` 目录中
- 修改embedding输出存储位置
```shell
vim ../../../atb_llm/models/qwen2/flash_causal_qwen2_gte.py
```
将其中233行修改为 logits_name = f"embedding_tensor_0"
- 运行指令
```shell
python eval_t2retrieval_gte_npu/gpu.py --model_type_or_path model_type_or_path --batch_size batch_size --device device
```
结果保存在当前路径results/
## 性能测试
- 进入以下路径
```shell
${llm_path}/tests/modeltest
```
- 运行指令
```shell
bash run.sh pa_fp16 [performance|full_CEval|full_BoolQ] ([case_pair]) [batch_size] qwen [weight_dir] [chip_num] ([max_position_embedding/max_sequence_length])
```
- 环境变量释义
1. HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0
这两个会影响性能,开启了变慢,但是会变成确定性计算,不开会变快,所以设置为0。
2. HCCL_BUFFSIZE=120
这个会影响hccl显存,需要设置,基本不影响性能。
3. ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
这个是显存优化,需要开,小batch、短序列场景不开更好。
示例:
```shell
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen-7b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen-14b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen-72b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-14b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen-14b-chat权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen-72b-chat权重路径} 8
HCCL_DETERMINISTIC=0 LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-0.5b-chat权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-4b-chat权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-7b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-14b-chat权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-32b-chat权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-72b权重路径} 8
HCCL_DETERMINISTIC=false LCCL_DETERMINISTIC=0 HCCL_BUFFSIZE=120 ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 bash run.sh pa_fp16 performance [[2048,2048],[1024,1024],[512,512],[256,256]] 1 qwen ${Qwen1.5-MoE-A2.7B-Chat权重路径} 8
```
- 参考[此README文件](../../../tests/modeltest/README.md)
## prefix_cache
- 参考[此README文件](../../../../../mindie_llm/text_generator/plugins/prefix_cache)
目前此特性仅支持qwen2-72b fp16使用
# Flash Attention推理使用说明
路径变量和权重转换等均与Paged Attention相同。
## 推理
### 对话测试
在`${llm_path}`目录执行以下指令
```shell
bash examples/models/qwen/run_fa.sh -m ${weight_path}
```
# 使用虚拟机运行Qwen(包含Qwen系列,Qwen1.5系列,Qwen2系列,Qwen2.5系列)模型
如果在虚拟机内运行Qwen模型,且虚拟机所在的物理机支持HCCS通信,需引入下列环境变量:
```shell
export NPU_VM_SUPPORT_HCCS = 1
```
# Qwen长序列推理
qwen长序列推理需要在qwen权重(config.json)中将use_dynamic_ntk参数与use_logn_attn同时设置成True。如Qwen-14B-Chat:
注意:如果不使用长序列推理,请将use_dynamic_ntk与use_logn_attn参数同时设置成False。
```json
{
"architectures": [
"QwenLMHeadModel"
],
// ...
"use_dynamic_ntk": true,
// ...
"use_logn_attn": true,
// ...
}
```
qwen2长序列推理需要在qwen2权重(config.json)中新增rope_scaling参数。如Qwen2-72B-Instruct:
注意:如果不使用长序列推理,请不要添加。
```json
{
"architectures": [
"Qwen2ForCausalLM"
],
// ...
"vocab_size": 152064,
// adding the following snippets
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
注:
- 除启动命令外,其他操作与执行PA相同
- qwen qwen1.5暂不支持bf16格式,请将权重路径下config.json文件的`torch_dtype`字段修改为`float16`
- 暂不支持chat模式。部分chat模型输出可能存在异常,如qwen1.5-32b-chat,若出现上述情况,请优先使用PA
- 长序列推理过程具有更多计算节点,因此相比于短序列,推理性能将有下降。
- qwen1.5部分Chat模型(4B、32B)fa暂不支持chat推理,请优先使用pa。如需使用fa请将输入改造成续写的样式,如:`What's deep learning?`改写成`Deep learning is`
- Qwen2 Qwen2.5系列模型当前800I A2采用bf16, 300I DUO使用fp16
================================================
FILE: llm-localization/ascend/mindie/README.md
================================================
# 入口
- https://www.hiascend.com/document/detail/zh/mindie/20RC2/index/index.html
# 模型支持列表
- https://www.hiascend.com/software/mindie/modellist
- https://www.hiascend.com/document/detail/zh/mindie/10RC1/description/whatismindie/mindie_what_0000.html
```
docker run -it -u root --name=mindie_server_t35 --net=host --ipc=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/workspace \
mindie_server:1.0.T35 \
/bin/bash
docker exec -it mindie_server_t35 bash
/home/HwHiAiUser/mindie-service_1.0.RC1_linux-aarch64/bin
cp -r /workspace/token_input_gsm.csv .
vim conf/config.json
cd /workspace/aicc/model_from_hf/chatglm3-6b-chat
/workspace/aicc/model_from_hf/Baichuan2-7B-Chat
---
/home/HwHiAiUser/atb-models/examples/convert
convert_weights.py
使用${llm_path}/examples/convert/convert_weights.py将bin转成safetensor格式
示例
python ${llm_path}/examples/convert/convert_weights.py --model_path ${weight_path}
输出结果会保存在bin权重同目录下
/home/HwHiAiUser
source set_env.sh
python examples/convert/convert_weights.py --model_path /workspace/aicc/model_from_hf/Baichuan2-7B-Chat --from_pretrained False
python examples/convert/convert_weights.py --model_path /workspace/aicc/model_from_hf/Baichuan2-7B-Chat
---
启动脚本
Flash Attention的启动脚本路径为${llm_path}/examples/run_fa.py
Page Attention的启动脚本路径为${llm_path}/examples/run_pa.py
```
## 镜像
- https://ascendhub.huawei.com/#/detail/mindie
```
# 获取登录访问权限,输入已设置的“镜像下载凭证”,如果未设置或凭证超过24小时过期,请点击登录用户名下拉设置镜像下载凭证
docker login -u 157xxxx4031 ascendhub.huawei.com
# 下载镜像
docker pull ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
```
## 迁移
```
docker save -o mindie-1.0.tar ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
scp root@192.xxx.16.211:/root/mindie-1.0.tar .
# 断点续传
rsync -P --rsh=ssh -r root@192.xxx.16.211:/root/mindie-1.0.tar .
```
## 性能测试
```
nohup python performance-stream-baichuan2.py > baichuan2.log 2>&1 &
```
================================================
FILE: llm-localization/ascend/mindie/config/chatglm3-6b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "chatglm3-6b",
"modelWeightPath" : "/home/aicc/model_from_hf/chatglm3-6b-chat-full",
"worldSize" : 2,
"cpuMemSize" : 5,
"npuMemSize" : 16,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/config/qwen-72b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "127.0.0.1",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen-72b",
"modelWeightPath" : "/home/aicc/model_from_hf/qwen-72b-chat-hf",
"worldSize" : 8,
"cpuMemSize" : 5,
"npuMemSize" : 10,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/config/run.sh
================================================
#!/bin/bash
# Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# shellcheck disable=SC2148
SCRIPT_DIR=$(cd $(dirname $0); pwd)
TESTS_DIR=$(cd $SCRIPT_DIR/core/; pwd)
test_mode="performance"
model_type="pa"
model_name=""
weight_dir=""
data_type="fp16"
hardware_type="NPU"
chip_num=0
dataset="CEval"
batch_size=0
case_pair="[]"
use_refactor="True"
max_position_embedding=-1
function fn_prepare()
{
if [ "$hardware_type" == "NPU" ]; then
if [ -z "$ASCEND_HOME_PATH" ];then
echo "env ASCEND_HOME_PATH not exists, fail"
exit 0
fi
if [ -z "$ATB_HOME_PATH" ];then
echo "env ATB_HOME_PATH not exists, fail"
exit 0
fi
fi
export INT8_FORMAT_NZ_ENABLE=1
export PYTHONPATH="${PYTHONPATH}:$(dirname "$(readlink -f "$0")")"
export PYTHONPATH="${PYTHONPATH}:$(dirname "$(dirname "$(dirname "$(readlink -f "$0")")")")"
IFS="_"
read -ra parts <<< "$1"
model_type="${parts[0]}"
if [ "$model_type" == "pa" ]; then
data_type="${parts[1]}"
fi
test_mode="$2"
if ! [ "$test_mode" == "performance" ]; then
read -ra parts <<< "$2"
test_mode="${parts[0]}"
dataset="${parts[1]}"
fi
if [ "$test_mode" == "performance" ]; then
export ATB_LLM_BENCHMARK_ENABLE=1
export ATB_LLM_BENCHMARK_FILEPATH="${SCRIPT_DIR}/benchmark.csv"
fi
}
function fn_run_single()
{
test_file="${model_name}_test.py"
test_path="${TESTS_DIR}/${test_file}"
if [[ ! -e "$test_path" ]];then
echo "model test file $test_path is not found."
exit 0
fi
if [ "$chip_num" == 0 ]; then
code_line=$(grep -A 1 "def get_chip_num(self):" "${test_path}" | tail -n 1)
if [ -z "$code_line" ]; then
echo "Warning: get_chip_num() not overwrite in '$test_file', use chip_num 1"
chip_num=1
else
chip_num=$(echo "$code_line" | awk -F 'return ' '{print $2}')
if ! [[ "$chip_num" =~ ^[1-9]+$ ]]; then
echo "Error: return value of get_chip_num() in '$test_file' is not a digit."
exit 1
fi
fi
fi
if [ "$hardware_type" == "NPU" ]; then
if ! [ -n "$ASCEND_RT_VISIBLE_DEVICES" ]; then
devices=""
for ((i=0; i /dev/null; then
hardware_type="GPU"
echo "INFO: Detected NVIDIA GPU"
else
if command -v npu-smi info &> /dev/null; then
echo "INFO: Detected Ascend NPU"
else
echo "Error: No GPU or NPU detected"
exit 1
fi
fi
if [ $# -eq 0 ]; then
echo "Error: require parameter. Please refer to README."
exit 1
fi
model_type=$1
case "$model_type" in
fa|pa_fp16|pa_bf16)
echo "INFO: current model_type: $model_type"
;;
*)
echo "ERROR: invalid model_type, only support fa, pa_fp16, pa_bf16"
;;
esac
test_modes=$2
case "$test_modes" in
performance|simplified_GSM8K|simplified_TruthfulQA|full_CEval|full_GSM8K|full_MMLU|full_TruthfulQA|full_BoolQ|full_HumanEval)
echo "INFO: current test_mode: $test_modes"
;;
*)
echo "ERROR: invalid test_mode, only support performance, simplified_GSM8K, simplified_TruthfulQA, \
full_CEval, full_GSM8K, full_MMLU, full_TruthfulQA, full_BoolQ, full_HumanEval"
exit 1
;;
esac
if [ "$test_modes" == "performance" ]; then
case_pair=$3
shift
fi
batch_size=$3
model_name=$4
if [ "$model_name" == "llama" ]; then
use_refactor=$5
shift
fi
weight_dir=$5
echo "INFO: current batch_size: $batch_size"
echo "INFO: current model_name: $model_name"
echo "INFO: current weight_dir: $weight_dir"
fn_prepare "$model_type" "$test_modes"
if ! [[ "$6" =~ ^[1-9]+$ ]]; then
echo "Error: input chip_num is not a digit."
exit 1
fi
chip_num=$6
echo "INFO: use input chip_num $chip_num"
if [ $# -ge 7 ]; then
if ! [[ "$7" =~ ^[0-9]+$ ]]; then
echo "Error: input max_position_embedding or max_seq_len is not a digit."
exit 1
fi
max_position_embedding=$7
echo "INFO: use input max_position_embedding or max_seq_len $max_position_embedding"
fi
fn_run_single
}
fn_main "$@"
================================================
FILE: llm-localization/ascend/mindie/config-1.0.RC1.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 8
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[$npuids]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "$model_name",
"modelWeightPath" : "$model_weight_path",
"worldSize" : $world_size,
"cpuMemSize" : 5,
"npuMemSize" : $npu_mem_size,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 192,
"maxPrefillTokens" : 12000,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 256,
"maxIterTimes" : 1024,
"maxPreemptCount" : 200,
"supportSelectBatch" : true,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/README.md
================================================
## 进入容器内
```
ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
# commit
docker commit -a "guodong" -m "mindie-service" b7fe01c81fcc ascendhub.huawei.com/public-ascendhub/mindie-service-env:1.0.RC1-800I-A2-aarch64
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
ascendhub.huawei.com/public-ascendhub/mindie-service-env:1.0.RC1-800I-A2-aarch64
docker commit -a "guodong" -m "mindie-service" 45bafed49c5b ascendhub.huawei.com/public-ascendhub/mindie-service-env:v2
docker save -o mindie-service-env.tar ascendhub.huawei.com/public-ascendhub/mindie-service-env:v2
```
## 启动服务
```
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen1.5-14b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
pkill -9 mindieservice_d
docker save -o mindie.tar ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
docker save -o mindie-service-online.tar ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.0
docker save -o mindie-service-online-v1.1.tar ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen-72b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
### qwen1.5
```
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen1.5-14b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen1.5-14b-2tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen1.5-72b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm --net=host --ipc=host \
-e ASCEND_VISIBLE_DEVICES=0,1,2,3 \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen1.5-7b-4tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
ASCEND_RT_VISIBLE_DEVICES:设置Device ID,指定应用进程可用的Device。支持一次指定一个或多个Device ID。
docker run -it --rm --net=host --ipc=host \
-e ASCEND_RT_VISIBLE_DEVICES=1 \
--shm-size=50g \
--privileged=true \
-w /home \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /home:/home \
-v /tmp:/tmp \
-v /home/aicc/docker/qwen1.5-7b-1tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
docker run -it --rm --net=host --ipc=host \
-e ASCEND_RT_VISIBLE_DEVICES=6,7 \
--shm-size=50g \
--privileged=true \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-w /home \
-v /home:/home \
-v /tmp:/tmp \
-v /home/aicc/docker/qwen1.5-7b-2tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
nohup python performance.py > qwen1.5-7b-2tp.log 2>&1 &
nohup python performance.py > qwen1.5-7b-4tp.log 2>&1 &
nohup python performance.py > qwen1.5-7b-1tp.log 2>&1 &
nohup python performance-qwen1.5.py > qwen1.5-7b-1tp.log 2>&1 &
```
### qwen1
```
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/qwen-72b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
### baichuan2
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-7b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-7b-4tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-7b-1tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
### baichuan2-13b
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
python convert_weights.py --model_path /home/aicc/model_from_hf/Baichuan2-13B-Chat
```
将congfig.json中的bfloat16改为float16。
```
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=0,1,2,3 \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-13b.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=0,1,2,3 \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-13b-8tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
kvcache:18
```
```
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=0,1 \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-13b-2tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
nohup python performance-stream-baichuan2.py > baichuan2-2tp.log 2>&1 &
```
### chatglm3
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/baichuan2-13b-2tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/chatglm3-6b-4tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
```
docker run -it --rm \
--net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /home/aicc/docker/chatglm3-6b-1tp.json:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json \
ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1
```
================================================
FILE: llm-localization/ascend/mindie/docker/TEST.md
================================================
```
# dockerfile
docker build --network=host -f mindie-env-1.0.Dockerfile -t ascendhub.huawei.com/public-ascendhub/mindie-env:1.0.RC1-800I-A2-aarch64 .
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
ascendhub.huawei.com/public-ascendhub/mindie-env:1.0.RC1-800I-A2-aarch64
docker build --network=host -f mindie-all-1.0.Dockerfile -t ascendhub.huawei.com/public-ascendhub/mindie-all:1.0.RC1-800I-A2-aarch64 .
docker run -it --rm --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
ascendhub.huawei.com/public-ascendhub/mindie-all:1.0.RC1-800I-A2-aarch64
```
================================================
FILE: llm-localization/ascend/mindie/docker/baichuan2-13b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "baichuan2-13b",
"modelWeightPath" : "/home/aicc/model_from_hf/Baichuan2-13B-Chat",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : 16,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/baichuan2-7b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "127.0.0.1",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "baichuan2-7b",
"modelWeightPath" : "/home/aicc/model_from_hf/Baichuan2-7B-Chat",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : 16,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/deploy.sh
================================================
#!/bin/bash
echo "入参:" $@
for a in "$@"; do
#echo $a
if [[ `echo $a | grep "^--model_name="` ]]; then
model_name=`echo $a | grep "^--model_name=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--model_weight_path="` ]]; then
model_weight_path=`echo $a | grep "^--model_weight_path=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--world_size="` ]]; then
world_size=`echo $a | grep "^--world_size=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--npu_mem_size="` ]]; then
npu_mem_size=`echo $a | grep "^--npu_mem_size=" | awk -F '=' '{print $2}'`
fi
done
if [ -z "$model_name" ]; then
model_name="default"
fi
if [ -z "$model_weight_path" ]; then
model_weight_path="/workspace/models"
fi
if [ -z "$world_size" ]; then
world_size=4
fi
if [ -z "$npu_mem_size" ]; then
npu_mem_size=8
fi
echo "平台入参: model_name: $model_name, model_weight_path: $model_weight_path , world_size: $world_size , npu_mem_size: $npu_mem_size"
npuids=""
card_num=$(($world_size - 1))
for i in `seq 0 $card_num`
do
if [[ $i == $card_num ]] ;
then
npuids=$npuids$i
else
npuids=$npuids$i","
fi
done
echo $npuids
DEPLOYMENT_CONF_PATH="/home/guodong.li/workspace/config.json"
# DEPLOYMENT_CONF_PATH="/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json"
cat < $DEPLOYMENT_CONF_PATH
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[$npuids]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "$model_name",
"modelWeightPath" : "$model_weight_path",
"worldSize" : $world_size,
"cpuMemSize" : 5,
"npuMemSize" : $npu_mem_size,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
EOF
echo "部署参数,$DEPLOYMENT_CONF_PATH"
cat $DEPLOYMENT_CONF_PATH
# source /usr/local/Ascend/ascend-toolkit/set_env.sh
# source /usr/local/Ascend/mindie/set_env.sh
# source /usr/local/Ascend/llm_model/set_env.sh
# export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
# cd /usr/local/Ascend/mindie/latest/mindie-service/bin
# ./mindieservice_daemon
================================================
FILE: llm-localization/ascend/mindie/docker/install_and_enable_cann.sh
================================================
#!/bin/bash
# Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Install Torch, Torch_npu, Apex
pip3 install torch-2.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
PYTORCH_MANYLINUX=pytorch_v2.1.0-6.0.rc1_py310.tar.gz
TORCH_NPU_IN_PYTORCH_MANYLINUX=torch_npu-2.1.0.post3_20240413-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
APEX_IN_PYTORCH_MANYLINUX=apex-0.1_ascend_20240413-cp310-cp310-linux_aarch64.whl
mkdir torch
cp ${PYTORCH_MANYLINUX} torch \
&& cd torch \
&& tar -xzvf ${PYTORCH_MANYLINUX} \
&& cd ..
echo "start install pytorch, wait for a minute..."
pip3 install torch/${TORCH_NPU_IN_PYTORCH_MANYLINUX} --quiet 2> /dev/null
if [ $? -eq 0 ]; then
echo "pip3 install torchnpu successfully"
else
echo "pip3 install torchnpu failed"
fi
pip3 install torch/${APEX_IN_PYTORCH_MANYLINUX} --quiet 2> /dev/null
if [ $? -eq 0 ]; then
echo "pip3 install apex successfully"
else
echo "pip3 install apex failed"
fi
rm -rf torch
# Install Ascend Cann Library
CANN_TOOKIT="Ascend-cann-toolkit_8.0.RC1_linux-aarch64.run"
CANN_KERNELS="Ascend-cann-kernels-910b_8.0.RC1_linux.run"
chmod +x *.run
yes | ./${CANN_TOOKIT} --install --quiet
toolkit_status=$?
if [ ${toolkit_status} -eq 0 ]; then
echo "install toolkit successfully"
else
echo "install toolkit failed with status ${toolkit_status}"
fi
yes | ./${CANN_KERNELS} --install --quiet
kernels_status=$?
if [ ${kernels_status} -eq 0 ]; then
echo "install kernels successfully"
else
echo "install kernels failed with status ${kernels_status}"
fi
# source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Install Atb and Model
if [ ! -d "/home/llm_model" ]; then
rm -rf /home/llm_model
fi
mkdir -p /usr/local/Ascend/llm_model
MINDIE="Ascend-mindie_*_linux-aarch64.run"
MODEL="Ascend-mindie-atb-models_1.0.RC1_linux-aarch64_torch2.1.0-abi0.tar.gz"
tar -xzf ./${MODEL} -C /usr/local/Ascend/llm_model
yes | ./${MINDIE} --install --quiet 2> /dev/null
atb_status=$?
if [ ${atb_status} -eq 0 ]; then
echo "install atb successfully"
else
echo "install atb failed with status ${atb_status}"
fi
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
================================================
FILE: llm-localization/ascend/mindie/docker/llm-server.sh
================================================
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
================================================
FILE: llm-localization/ascend/mindie/docker/mindie-1.0.Dockerfile
================================================
#FROM ascendhub.huawei.com/public-ascendhub/mindie-service-env:1.0.RC1-800I-A2-aarch64
FROM ascendhub.huawei.com/public-ascendhub/mindie-service-env:v2
ENV APP_DIR=/workspace
RUN mkdir -p $APP_DIR
# COPY qwen1.5-14b.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
COPY baichuan2-7b.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
COPY llm-server.sh $APP_DIR
RUN chmod -R 777 $APP_DIR/llm-server.sh
ENTRYPOINT $APP_DIR/llm-server.sh
# docker build --network=host -f mindie-1.0.Dockerfile -t ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.0 .
# docker build --network=host -f mindie-1.0.Dockerfile -t ascendhub.huawei.com/public-ascendhub/mindie-service-online:v1.1 .
================================================
FILE: llm-localization/ascend/mindie/docker/mindie-all-1.0.Dockerfile
================================================
FROM ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
# USER root
COPY driver /usr/local/Ascend/driver
RUN ls -al /usr/local/Ascend/driver
ENV APP_DIR=/workspace
RUN mkdir -p $APP_DIR
COPY install_and_enable_cann.sh /opt/package/install_and_enable_cann.sh
RUN cd /opt/package && ls -al && cat /opt/package/install_and_enable_cann.sh && source ./install_and_enable_cann.sh
RUN pip install transformers==4.37.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
COPY qwen1.5-14b.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
COPY llm-server.sh $APP_DIR
RUN chmod -R 777 $APP_DIR/llm-server.sh
ENTRYPOINT ["$APP_DIR/llm-server.sh"]
================================================
FILE: llm-localization/ascend/mindie/docker/mindie-env-1.0.Dockerfile
================================================
FROM ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
USER root
ENV APP_DIR=/workspace
RUN mkdir -p $APP_DIR
RUN cd /opt/package && ls -al && source ./install_and_enable_cann.sh
RUN pip install transformers==4.37.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
================================================
FILE: llm-localization/ascend/mindie/docker/qwen-72b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "127.0.0.1",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen-72b",
"modelWeightPath" : "/home/aicc/model_from_hf/qwen-72b-chat-hf",
"worldSize" : 8,
"cpuMemSize" : 5,
"npuMemSize" : 8,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/qwen1.5-14b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "127.0.0.1",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen1.5-14b",
"modelWeightPath" : "/home/aicc/model_from_hf/Qwen1.5-14B-Chat",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : 12,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/qwen1.5-72b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "127.0.0.1",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen1.5-72b",
"modelWeightPath" : "/home/aicc/model_from_hf/Qwen1.5-72B",
"worldSize" : 8,
"cpuMemSize" : 5,
"npuMemSize" : 8,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/docker/qwen1.5-7b.json
================================================
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 4
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[0,1]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "qwen1.5-7b",
"modelWeightPath" : "/home/aicc/model_from_hf/Qwen1.5-7B-Chat",
"worldSize" : 2,
"cpuMemSize" : 5,
"npuMemSize" : 16,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 200,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/llm-server.sh
================================================
#!/bin/bash
echo "入参:" $@
for a in "$@"; do
#echo $a
if [[ `echo $a | grep "^--model_name="` ]]; then
model_name=`echo $a | grep "^--model_name=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--model_weight_path="` ]]; then
model_weight_path=`echo $a | grep "^--model_weight_path=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--world_size="` ]]; then
world_size=`echo $a | grep "^--world_size=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--npu_mem_size="` ]]; then
npu_mem_size=`echo $a | grep "^--npu_mem_size=" | awk -F '=' '{print $2}'`
fi
done
if [ -z "$model_name" ]; then
model_name="default"
fi
if [ -z "$model_weight_path" ]; then
model_weight_path="/workspace/model"
fi
if [ -z "$world_size" ]; then
world_size=4
fi
if [ -z "$npu_mem_size" ]; then
npu_mem_size=8
fi
echo "平台入参: model_name: $model_name, model_weight_path: $model_weight_path , world_size: $world_size , npu_mem_size: $npu_mem_size"
npuids=""
card_num=$(($world_size - 1))
for i in `seq 0 $card_num`
do
if [[ $i == $card_num ]] ;
then
npuids=$npuids$i
else
npuids=$npuids$i","
fi
done
echo $npuids
# DEPLOYMENT_CONF_PATH="/home/guodong.li/workspace/config.json"
DEPLOYMENT_CONF_PATH="/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json"
cat < $DEPLOYMENT_CONF_PATH
{
"OtherParam":
{
"ResourceParam" :
{
"cacheBlockSize" : 128,
"preAllocBlocks" : 8
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "/logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "0.0.0.0",
"port" : 1025,
"maxLinkNum" : 300,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"tlsCrl" : "security/certs/server_crl.pem"
}
},
"WorkFlowParam":
{
"TemplateParam" :
{
"templateType": "Standard",
"templateName" : "Standard_llama",
"pipelineNumber" : 1
}
},
"ModelDeployParam":
{
"maxSeqLen" : 2560,
"npuDeviceIds" : [[$npuids]],
"ModelParam" : [
{
"modelInstanceType": "Standard",
"modelName" : "$model_name",
"modelWeightPath" : "$model_weight_path",
"worldSize" : $world_size,
"cpuMemSize" : 5,
"npuMemSize" : $npu_mem_size,
"backendType": "atb"
}
]
},
"ScheduleParam":
{
"maxPrefillBatchSize" : 256,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 256,
"maxIterTimes" : 1024,
"maxPreemptCount" : 200,
"supportSelectBatch" : true,
"maxQueueDelayMicroseconds" : 50000
}
}
EOF
echo "部署参数,$DEPLOYMENT_CONF_PATH"
cat $DEPLOYMENT_CONF_PATH
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
================================================
FILE: llm-localization/ascend/mindie/mindid-1.0-offical.md
================================================
# README
- https://www.hiascend.com/document/detail/zh/mindie/10RC1/description/whatismindie/mindie_what_0000.html
- 此README对各模型统一的脚本及其使用方式进行介绍
## 路径变量解释
| 变量名 | 含义 |
|--------|--------------------------------------------------|
| working_dir | 加速库及模型库下载后放置目录 |
| llm_path | 模型仓所在路径。若使用编译好的包,则路径为`${working_dir}/MindIE-LLM/`;若使用gitee下载的代码,则路径为`${working_dir}/MindIE-LLM/examples/atb_models` |
| weight_path | 模型权重路径 |
| w8a8s_weight_path | 稀疏量化权重路径 |
| w8a8sc_weight_path | 切分并压缩后的稀疏量化权重路径 |
| cur_dir | 运行指令或执行脚本时的路径(当前目录) |
## 权重
### 权重设置
- `${weight_path}/config.json`文件中需设置`dtype`和`quantize`类型来标识权重的量化类型和精度
- 若`dtype`和`quantize`字段不存在,需新增
- 配置
| 量化类型及精度 | torch_dtype | quantize |
|----------------|-------------|----------|
| FP16 | "float16" | 无 |
| BF16 | "bfloat16" | 无 |
| W8A8 | "float16" | "w8a8" |
| W8A8S | "float16" | "w8a8s" |
| W8A8SC | "float16" | "w8a8sc" |
| W8A16 | "float16" | "w8a16" |
- 示例
- LLaMa模型的权重使用BF16精度,非量化
```json
{
"architectures": [
"LlamaForCausalLM"
],
...
"torch_dtype": "bfloat16",
...
}
```
- LLaMa模型的权重使用FP16精度,W8A16量化
```json
{
"architectures": [
"LlamaForCausalLM"
],
...
"torch_dtype": "float16",
...
"quantize": "w8a16",
}
```
### 权重转换
> 当前仅支持加载safetensor格式的权重文件,若环境中已有bin格式的权重文件,请按照如下方式进行转换
> 若当前环境不存在模型权重,请至hugging face官网下载
- 使用`${llm_path}/examples/convert/convert_weights.py`将bin转成safetensor格式
- 示例
```shell
cd ${llm_path}
python examples/convert/convert_weights.py --model_path ${weight_path}
```
- 注意:必须先进入`${llm_path}`路径下执行以上命令,否则由于脚本中存在相对路径,会导致moudle not found的问题
- 输出结果会保存在bin权重同目录下
### 稀疏量化权重生成
- Step 1:生成稀疏量化权重
```shell
cd ${llm_path}
python -m examples.convert.model_slim.sparse_quantifier --model_path ${weight_path} --save_directory ${w8a8s_weight_path}
```
- Step 2:量化权重切分及压缩
```shell
torchrun --nproc_per_node {TP数} -m examples.convert.model_slim.sparse_compressor --model_path ${w8a8s_weight_path} --save_directory ${w8a8sc_weight_path}
```
- TP数为tensor parallel并行个数
- 注意:若权重生成时以TP=4进行切分,则运行时也需以TP=4运行
- 示例
```shell
torchrun --nproc_per_node 2 -m examples.convert.model_slim.sparse_compressor --model_path /data1/weights/model_slim/llama2-7b_w8a8s --save_directory /data1/weights/model_slim/llama2-7b_w8a8sc_temp
```
## 启动脚本
- Flash Attention的启动脚本路径为`${llm_path}/examples/run_fa.py`
- Page Attention的启动脚本路径为`${llm_path}/examples/run_pa.py`
### 启动脚本相关环境变量
- `USE_ASCEND`
- 是否使用昇腾加速库
- 设置为1使用加速库,设置为0则不使用加速库;默认使用
- `MAX_MEMORY_GB`
- 限制最大显存
- 默认在服务器最大显存GB的基础上预留3GB显存
- 若出现显存不足导致的异常,请将该参数改小
- `ASCEND_RT_VISIBLE_DEVICES`
- 指定当前机器上可用的逻辑NPU核心,多个核心间使用逗号相连
- 核心编号需要通过 npu-smi info 指令查阅
- Atlas 800I A2服务器需基于输出的 NPU 列查阅

- Atlas 300I DUO服务器需基于输出的 Device 列查阅

- 若要使用单卡双芯,请指定至少两个可见核心;若要使用双卡四芯,请指定至少四个可见核心
- `BIND_CPU`
- 绑定CPU核心开关
- 设置为1进行绑核,设置为0则不绑核;默认进行绑核
- 若当前机器未设置NUMA或绑核失败,可将 BIND_CPU 设为 0
- `ATB_PROFILING_ENABLE`
- 是否落性能profiling文件
- 设置为1生成profiling文件,设置为0则不生成;默认不生成profiling文件
- `PROFILING_FILEPATH`
- (若生成profiling文件)profiling文件的路径
- 默认为`${cur_dir}/profiling`
- `ATB_LLM_BENCHMARK_ENABLE`
- 是否统计端到端和各token的性能数据
- 设置为1统计耗时,设置为0则不统计;默认不统计
- `ATB_LLM_BENCHMARK_FILEPATH`
- 性能数据的保存路径
- 默认为`${cur_dir}/benchmark_result/benchmark.csv`
### run_fa.py脚本参数
- `--model_path`
- 模型权重路径
- `--input_text`
- 输入问题
- 支持字符串列表或者字符串
- 若此值为字符串,则构造推理输入时会基于batch size入参复制多份
- 若此值为列表,则构造推理输入时会忽略batch size入参,真实的batch size为此列表实际长度
- `--max_input_length`
- 最大输入长度
- 默认512个token
- 若输入长度不足512个token,会自动使用padding补齐
- `--max_output_length`
- 最大输出长度
- - 默认输出20个token
- `--batch_size`
- 推理时固定的batch数量
- 默认单batch
- `--is_flash_causal_lm`
- 是否使用Paged Attention,默认不使用
- `--use_refactor`
- 若设置为True则使用归一后代码,若设置为False,则使用未归一的代码;默认开启use_refactor
- 示例
```shell
# 使用多卡运行Flash Attention,设置模型权重路径,设置输出长度为2048个token,精度使用BF16
torchrun --nproc_per_node 2 --master_port 20038 -m examples.run_fa --model_path ${weight_path} --max_output_length 2048 --is_bf16
```
### run_pa.py脚本参数
- `--model_path`
- 模型权重路径
- `--input_text`
- 输入问题
- 支持字符串列表或者字符串
- 若此值为单元素列表或字符串,则构造推理输入时会基于batch size入参复制多份
- 若此值为多元素列表,则构造推理输入时会忽略batch size入参,真实的batch size为此列表实际长度
- `--max_position_embeddings`
- 模型可接受的最长输入长度
- 默认从模型权重的config文件中读取
- `--max_output_length`
- 最大输出长度
- - 默认输出20个token
- `--max_prefill_tokens`
- Prefill推理阶段,最大输入长度
- 默认4096个token
- `--max_batch_size`
- 最大batch size,实际运行的batch size动态变化,有可能达不到设置的最大batch size
- 默认单batch
- `--is_flash_model`
- 是否使用Paged Attention,默认使用
- `--use_refactor`
- 若设置为True则使用归一后代码,若设置为False,则使用未归一的代码;默认开启use_refactor
- 示例
```shell
# 使用多卡运行Paged Attention,设置模型权重路径,设置输出长度为2048个token,使用归一后代码
torchrun --nproc_per_node 2 --master_port 20038 -m examples.run_pa --model_path ${weight_path} --max_output_length 2048 --use_refactor True
```
### 特殊场景说明
300I DUO 和 800I A2 上,单机多用户场景下,由于通信算子之间采用共享内存进行通信,每个用户需要配置如下环境变量,进行共享内存的区分;
`export ATB_SHARE_MEMORY_NAME_SUFFIX="user1"`
单机多用户场景:如300I DUO上有4张卡,每张卡单独跑一个模型推理任务,需要根据
不同任务设置上述环境变量来区分,如`user1`、`user2`
================================================
FILE: llm-localization/ascend/mindie/mindid-performance.md
================================================
```
curl --location --request POST 'http://192.168.16.211:1025/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "<|im_start|>user\n你好,我的名字是李聪明。请问你是谁?<|im_end|>\n<|im_start|>assistant\n你好,李聪明!很高兴认识你。我是一个大型语言模型,你可以叫我通义千问。有什么问题或需要帮助的话,请随时告诉我。<|im_end|>\n<|im_start|>user\n我最近心情很糟糕,能给我一些建议吗?<|im_end|><|im_start|>assistant\n我很理解你现在的感受。面对糟糕的心情,以下是一些可能有帮助的建议:\n\n1. **与他人分享**:告诉信任的朋友或家人你的感受,他们可能能提供安慰和支持。\n2. **自我关怀**:确保每天有足够的休息,做些你喜欢的事情,比如阅读、听音乐或运动。\n3. **运动与放松**:适度的运动可以帮助释放压力,尝试瑜伽、冥想或深呼吸练习。\n4. **寻求专业帮助**:如果你觉得压力过大,考虑咨询心理医生或心理咨询师。\n5. **保持积极思考**:试着找出生活中的小确幸,每天对自己说一些积极的话。\n6. **时间管理**:合理安排时间,避免过度压力,留出放松的时间。\n\n记住,处理情绪需要时间和耐心,不要对自己太苛刻。如果你的情绪持续低落,可能需要更专业的支持。希望这些建议对你有所帮助。<|im_end|>\n<|im_start|>user\n请问我叫什么名字?<|im_end|>\n<|im_start|>assistant\n"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true,
"stop": "<|im_end|>",
"stream": true
}'
```
================================================
FILE: llm-localization/ascend/mindie/mindie-1.0.Dockerfile
================================================
FROM ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
RUN cd /opt/package && source install_and_enable_cann.sh \
&& source /usr/local/Ascend/ascend-toolkit/set_env.sh \
&& source /usr/local/Ascend/mindie/set_env.sh \
&& source /usr/local/Ascend/llm_model/set_env.sh
================================================
FILE: llm-localization/ascend/mindie/mindie-1.0.RC2.md
================================================
文档:
- https://www.hiascend.com/document/detail/zh/mindie/10RC2/whatismindie/mindie_what_0001.html
docker:
- https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f
rsync -P --rsh=ssh -r root@192.168.16.xxx:/root/mindie-1.0.rc2.tar .
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:1.0.RC2-800I-A2-aarch64
```
docker run -it -d --name mindie-rc2-45 --net=host \
-e ASCEND_VISIBLE_DEVICES=4,5 \
-p 1925:1025 \
--shm-size=32g \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /data/model_from_hf:/workspace/model \
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:1.0.RC2-800I-A2-aarch64 \
/bin/bash
docker exec -it mindie-rc2-45 bash
cd /opt/package
# 安装CANN包
source ./install_and_enable_cann.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
vim /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
/workspace/model/Qwen1.5-7B-Chat/
export MIES_PYTHON_LOG_TO_FILE=1
export MIES_PYTHON_LOG_TO_STDOUT=1
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
```
## 新镜像
```
docker commit -a "guodong" -m "mindie-1.0.RC2" 365815a95f16 harbor/ascend/mindie-base:1.0.RC2
docker save -o mindie-base.tar harbor/ascend/mindie-base:1.0.RC2
rsync -P --rsh=ssh -r root@192.168.16.211:/home/workspace/mindie-base.tar .
# -p 192.168.16.xx:1025:1025
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=2,3 \
-p 1025:1025 \
--shm-size=32g \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /data/model_from_hf:/workspace/model \
harbor/ascend/mindie-base:1.0.RC2 \
/bin/bash
```
```
llm-server3.sh
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=6,7 \
-p 1825:1025 \
--env AIE_LLM_CONTINUOUS_BATCHING=1 \
--shm-size=32g \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /data/model_from_hf/Qwen1.5-7B-Chat:/workspace/model \
-v /home/workspace/llm-server3.sh:/workspace/llm-server.sh \
-v /home/workspace/mindservice.log:/usr/local/Ascend/mindie/latest/mindie-service/logs/mindservice.log \
harbor/ascend/mindie-base:1.0.RC2 \
/bin/bash
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=6,7 \
-p 1525:1025 \
--env AIE_LLM_CONTINUOUS_BATCHING=1 \
--shm-size=32g \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /data/model_from_hf/Qwen1.5-7B-Chat:/workspace/model \
-v /home/workspace/llm-server3.sh:/workspace/llm-server.sh \
-v /home/workspace/mindservice.log:/usr/local/Ascend/mindie/latest/mindie-service/logs/mindservice.log \
harbor/ascend/mindie-base:1.0.RC2 \
/workspace/llm-server.sh \
--model_name=qwen-chat \
--model_weight_path=/workspace/model \
--world_size=2 \
--npu_mem_size=15
docker run -it --rm \
-e ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-p 1525:1025 \
--env AIE_LLM_CONTINUOUS_BATCHING=1 \
--shm-size=32g \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /data/model_from_hf/Qwen2-72B-Instruct:/workspace/model \
-v /home/workspace/llm-server3.sh:/workspace/llm-server.sh \
-v /home/workspace/mindservice.log:/usr/local/Ascend/mindie/latest/mindie-service/logs/mindservice.log \
harbor/ascend/mindie-base:1.0.RC2 \
/workspace/llm-server.sh \
--model_name=qwen-chat \
--model_weight_path=/workspace/model \
--world_size=8 \
--npu_mem_size=8
```
================================================
FILE: llm-localization/ascend/mindie/mindie-1.0.md
================================================
- https://ascendhub.huawei.com/#/detail/mindie
- ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
一键使能 CANN 软件栈的 shell 脚本(install_and_enable_cann.sh)
- /usr/local/Ascend/llm_model/pytorch/examples/chatglm2/6b/README.md
建议将权重存放于 /home/chatglm2_6b/weight 目录下,并设置 CHECKPOINT=/home/chatglm2_6b/weight
## 编写 docker 启动脚本
编写 docker 启动脚本 start-docker.sh 如下所示,存放于 /home/chatglm2_6b 目录下
```
IMAGES_ID=$1
NAME=$2
if [ $# -ne 2 ]; then
echo "error: need one argument describing your container name."
exit 1
fi
docker run --name ${NAME} -it -d --net=host --shm-size=500g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
${IMAGES_ID}
```
参数说明:
- IMAGES_ID 为镜像版本号。(docker images 命令回显中的 IMAGES ID)
- NAME 为启动容器名,可自定义设置。
## 启动并进入容器
依次执行如下命令启动并进入容器:
```
cd /home/chatglm2_6b
# 用户可以设置 docker images 命令回显中的 IMAGES ID
image_id=001b7368f6e0
# 用户可以自定义设置镜像名
custom_image_name=chatGLM2_6B
# 启动容器(确保启动容器前,本机可访问外网)
bash start-docker.sh ${image_id} ${custom_image_name}
# 进入容器
docker exec -itu root ${custom_image_name} bash
```
## 使能昇腾CANN软件栈
```
cd /opt/package
# 安装CANN包
source install_and_enable_cann.sh
# 若退出后重新进入容器,则需要重新加载 CANN 环境变量,执行以下三行命令
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
```
## 推理 Chatglm2_6b 模型
```
cd /usr/local/Ascend/llm_model
# 权重转 safetensor
python examples/convert/convert_weights.py --model_path ${CHECKPOINT}
# 执行推理脚本
python examples/run_pa.py --model_path ${CHECKPOINT}
启动后会执行推理,显示默认问题Question和推理结果Answer,若用户想要自定义输入问题,可使用--input_texts参数设置,如:
python examples/run_pa.py --model_path ${CHECKPOINT} --input_texts "What is deep learning?"
```
## Qwen1.5-14B
```
# docker rm -f mindie-dev
docker run --name mindie-dev2 -it -d --net=host --ipc=host \
--shm-size=50g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
ascendhub.huawei.com/public-ascendhub/mindie:1.0.RC1-800I-A2-aarch64
docker exec -itu root mindie-dev2 bash
cd /opt/package
# 安装CANN包
source ./install_and_enable_cann.sh
# 若退出后重新进入容器,则需要重新加载 CANN 环境变量,执行以下三行命令
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
rm /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
vim /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
- /home/aicc/model_from_hf/Qwen1.5-14B-Chat
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
```
```
transformers==4.30.2
pip install transformers==4.37.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
"torch_dtype": "bfloat16" 改为 "float16"
```
================================================
FILE: llm-localization/ascend/mindie/mindie-1.0.rc2-config.json
================================================
{
"OtherParam" :
{
"ResourceParam" :
{
"cacheBlockSize" : 128
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "172.17.0.2",
"managementIpAddress" : "127.0.0.2",
"port" : 1025,
"managementPort" : 1026,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"tlsCrl" : "security/certs/server_crl.pem",
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management_server.pem",
"managementTlsPk" : "security/keys/management_server.key.pem",
"managementTlsPkPwd" : "security/pass/management_mindie_server_key_pwd.txt",
"managementTlsCrl" : "security/certs/management_server_crl.pem",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : true,
"interNodeTlsCaFile" : "security/ca/ca.pem",
"interNodeTlsCert" : "security/certs/server.pem",
"interNodeTlsPk" : "security/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb"
}
},
"WorkFlowParam" :
{
"TemplateParam" :
{
"templateType" : "Standard",
"templateName" : "Standard_llama"
}
},
"ModelDeployParam" :
{
"engineName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"tokenizerProcessNumber" : 8,
"maxSeqLen" : 2560,
"npuDeviceIds" : [[$npuids]],
"multiNodesInferEnabled" : false,
"ModelParam" : [
{
"modelName" : "$model_name",
"modelWeightPath" : "$model_weight_path",
"worldSize" : $world_size,
"cpuMemSize" : 5,
"npuMemSize" : $npu_mem_size,
"backendType": "atb",
"pluginParams" : ""
}
]
},
"ScheduleParam" :
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 0,
"supportSelectBatch" : true,
"maxQueueDelayMicroseconds" : 5000
}
}
================================================
FILE: llm-localization/ascend/mindie/mindie-1.0.rc2-llm-server.sh
================================================
#!/bin/bash
echo "入参:" $@
for a in "$@"; do
#echo $a
if [[ `echo $a | grep "^--model_name="` ]]; then
model_name=`echo $a | grep "^--model_name=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--model_weight_path="` ]]; then
model_weight_path=`echo $a | grep "^--model_weight_path=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--world_size="` ]]; then
world_size=`echo $a | grep "^--world_size=" | awk -F '=' '{print $2}'`
fi
if [[ `echo $a | grep "^--npu_mem_size="` ]]; then
npu_mem_size=`echo $a | grep "^--npu_mem_size=" | awk -F '=' '{print $2}'`
fi
done
if [ -z "$model_name" ]; then
model_name="default"
fi
if [ -z "$model_weight_path" ]; then
model_weight_path="/workspace/model"
fi
if [ -z "$world_size" ]; then
world_size=4
fi
if [ -z "$npu_mem_size" ]; then
npu_mem_size=8
fi
echo "平台入参: model_name: $model_name, model_weight_path: $model_weight_path , world_size: $world_size , npu_mem_size: $npu_mem_size"
npuids=""
card_num=$(($world_size - 1))
for i in `seq 0 $card_num`
do
if [[ $i == $card_num ]] ;
then
npuids=$npuids$i
else
npuids=$npuids$i","
fi
done
echo $npuids
ip=`hostname -I`
echo "docker ip: [$ip]"
ip=$(echo "$ip" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
echo "docker handle ip: [$ip]"
# DEPLOYMENT_CONF_PATH="/home/guodong.li/workspace/config.json"
DEPLOYMENT_CONF_PATH="/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json"
cat < $DEPLOYMENT_CONF_PATH
{
"OtherParam" :
{
"ResourceParam" :
{
"cacheBlockSize" : 128
},
"LogParam" :
{
"logLevel" : "Info",
"logPath" : "logs/mindservice.log"
},
"ServeParam" :
{
"ipAddress" : "$ip",
"managementIpAddress" : "127.0.0.2",
"port" : 1025,
"managementPort" : 1026,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"tlsCrl" : "security/certs/server_crl.pem",
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management_server.pem",
"managementTlsPk" : "security/keys/management_server.key.pem",
"managementTlsPkPwd" : "security/pass/management_mindie_server_key_pwd.txt",
"managementTlsCrl" : "security/certs/management_server_crl.pem",
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : true,
"interNodeTlsCaFile" : "security/ca/ca.pem",
"interNodeTlsCert" : "security/certs/server.pem",
"interNodeTlsPk" : "security/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/pass/mindie_server_key_pwd.txt",
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb"
}
},
"WorkFlowParam" :
{
"TemplateParam" :
{
"templateType" : "Standard",
"templateName" : "Standard_llama"
}
},
"ModelDeployParam" :
{
"engineName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"tokenizerProcessNumber" : 8,
"maxSeqLen" : 2560,
"npuDeviceIds" : [[$npuids]],
"multiNodesInferEnabled" : false,
"ModelParam" : [
{
"modelName" : "$model_name",
"modelWeightPath" : "$model_weight_path",
"worldSize" : $world_size,
"cpuMemSize" : 5,
"npuMemSize" : $npu_mem_size,
"backendType": "atb",
"pluginParams" : ""
}
]
},
"ScheduleParam" :
{
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 8192,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 512,
"maxPreemptCount" : 0,
"supportSelectBatch" : true,
"maxQueueDelayMicroseconds" : 5000
}
}
EOF
echo "部署参数,$DEPLOYMENT_CONF_PATH"
cat $DEPLOYMENT_CONF_PATH
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export MIES_PYTHON_LOG_TO_FILE=1
export MIES_PYTHON_LOG_TO_STDOUT=1
export PYTHONPATH=/usr/local/Ascend/llm_model:$PYTHONPATH
cd /usr/local/Ascend/mindie/latest/mindie-service/bin
./mindieservice_daemon
================================================
FILE: llm-localization/ascend/mindie/mindie-2.0.rc2.md
================================================
# 量化
```
cd ${ATB_SPEED_HOME_PATH}
python examples/models/llama3/convert_quant_weights.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --w_bit 8 --a_bit 8 --disable_level L0 --device_type cpu --anti_method m1 --act_method 1 --calib_file ${llm_path}/examples/convert/model_slim/boolq.jsonl
```
================================================
FILE: llm-localization/ascend/mindie/mindie-20240411.md
================================================
```
docker run -it --name=mindie_server_t37 --net=host --ipc=host --privileged=true \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/workspace \
mindie_server:1.0.T37 \
/bin/bash
docker exec -it mindie_server_t37 bash
docker inspect mindie_server:1.0.T37
docker inspect -f '{{with .State}} {{.Pid}} {{end}}' mindie_server_t37
docker inspect --format='{{.Name}}' mindie_server_t37
docker inspect --format='ARCH: {{.Architecture}} , OS: {{.Os}}' mindie_server:1.0.T37
# ARCH: arm64 , OS: linux
docker inspect -f {{".Architecture"}} mindie_server:1.0.T37
```
{{range .NetworkSettings.Networks}}
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/mindie/latest/mindie-service/set_env.sh
source /opt/atb-models/set_env.sh
```
```
cd /usr/local/Ascend/mindie/latest/mindie-service/latest
vim conf/config.json
```
npuMemSize: NPU中可以用来申请 kv cache的 size上限。单位:GB。
建议值:8。
npuMemSize=(总空闲-权重/tp数)*系数,其中系数取 0.8。
以 llama-65b为例:
总显存 64GB,空闲状态卡上有 3~4GB的占用,llama-65b
的总权重为 122GB,用 8张卡跑,则 npuMemSize取值的
上限为:(64-4-(122/8))*0.8。
```
/workspace/dataset/qwen
/workspace/aicc/model_from_hf/Baichuan2-7B-Chat
/workspace/aicc/model_from_hf/chatglm3-6b-chat
/workspace/aicc/model_from_hf/chatglm3-6b-chat-full
```
## Mindie_server
```
# 执行./mindieservice_daemon启动服务
cd /usr/local/Ascend/mindie/latest/mindie-service/latest/bin
./mindieservice_daemon
```
## 接口请求
```
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"inputs": "保持健康的方法",
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"max_new_tokens": 20,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": null,
"stop": [
"photographer"
],
"top_k": 10,
"temperature": 0.5,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": null,
"typical_p": 0.95,
"watermark": true
},
"stream": false}' http://127.0.0.1:1025/generate
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"inputs": "保持健康的方法",
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"max_new_tokens": 20,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": null,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": null,
"typical_p": 0.95,
"watermark": true
},
"stream": false}' http://127.0.0.1:1025/generate
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "qwen-72b",
"messages": [
{
"role": "system",
"content": "你是一个有博学多才的助手."
},
{
"role": "user",
"content": "保持健康的方法"
}
]
}' http://127.0.0.1:1025/v1/chat/completions
```
## 性能测试
```
cp token_input_gsm.csv
/usr/local/Ascend/mindie/latest/mindie-service/latest/bin
./llm_engine_test
```
```
cp token_input_gsm.csv /usr/local/Ascend/mindie/latest/mindie-service/latest/bin
cd /usr/local/Ascend/mindie/latest/mindie-service/latest/bin
./llm_engine_test
```
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/mindie/latest/mindie-service/set_env.sh
source /opt/atb-models/set_env.sh
cd /opt/atb-models/tests/modeltest
bash run.sh pa_fp16 performance [[2048,2048]] 1 qwen /workspace/aicc/model_from_hf/qwen-72b-chat-hf 8 > qwen-72b-2048-1-8.log
bash run.sh pa_fp16 performance [[2048,2048],[2048,2048],[2048,2048],[2048,2048],[2048,2048]] 1 qwen /workspace/aicc/model_from_hf/qwen-72b-chat-hf 8 > qwen-72b-2048-1-8.log
bash run.sh pa_fp16 performance [[2048,2048]] 4 qwen /workspace/aicc/model_from_hf/qwen-72b-chat-hf 8 > qwen-72b-2048-4-8.log
```
```
2024-04-23 06:43:44,225 [INFO] [pid: 5844] generate.py-186: Prefill time: 299.3795871734619ms, Decode token time: 37.169588794344854ms, E2E time: 76385.52784919739ms
2024-04-23 06:43:45,396 - [INFO] - model_test.py:419 - batch: 1, seq_len_in: 2048, seq_len_out: 2048, total_time: 76.46122479438782, first_token_time: 299.38, non_first_token_time: 37.17, non_first_token_throughput: 26.90341673392521, e2e_time: 76.46122479438782, e2e_throughput: 26.784818128499577
2024-04-23 06:43:45,397 - [INFO] - model_test.py:434 - batch: 1, non_first_token_throughput_total: 26.90341673392521, non_first_token_throughput_average: 26.90341673392521, e2e_throughput_total: 26.784818128499577, e2e_throughput_average: 26.784818128499577
2024-04-23 06:43:45,399 - [INFO] - model_test.py:464 - qwen_72b batch1 result saved in /opt/atb-models/tests/modeltest/base/../result/qwen_72b/pa_fp16_batch1_performance_test_result.csv
2024-04-23 06:43:45,399 - [INFO] - model_test.py:466 - qwen_72b batch1 formatted result saved in /opt/atb-models/tests/modeltest/base/../result/qwen_72b/pa_fp16_batch1_performance_test_result_formatted.csv
2024-04-23 07:33:44,898 - [INFO] - model_test.py:419 - batch: 4, seq_len_in: 2048, seq_len_out: 2048, total_time: 86.89873886108398, first_token_time: 1157.03, non_first_token_time: 41.85, non_first_token_throughput: 95.57945041816009, e2e_time: 86.89873886108398, e2e_throughput: 94.27064313436932
2024-04-23 07:33:44,898 - [INFO] - model_test.py:434 - batch: 4, non_first_token_throughput_total: 95.57945041816009, non_first_token_throughput_average: 95.57945041816009, e2e_throughput_total: 94.27064313436932, e2e_throughput_average: 94.27064313436932
```
```python
def run_performance_test():
non_first_token_throughput_total = 0
e2e_throughput_total = 0
for seq_len_in, seq_len_out in self.case_pair:
self.logger.info("batch_size: " + str(self.batch_size) +
", seq_len_in: " + str(seq_len_in) +
", seq_len_out: " + str(seq_len_out))
if self.model_type == "fa":
input_ids = torch.randint(0, self.model.config.vocab_size, [self.batch_size, seq_len_in],
dtype=torch.int64)
attention_mask = torch.ones((self.batch_size, seq_len_in), dtype=torch.int64)
inputs = self.tokenizer(performance_prompt * self.batch_size, return_tensors="pt",
padding='max_length',
max_length=seq_len_in)
inputs["input_ids"] = input_ids
inputs["attention_mask"] = attention_mask
input_ids = inputs.input_ids.to(self.model.device)
attention_mask = inputs.attention_mask.to(self.model.device)
with torch.no_grad():
getattr(torch, self.core_type).synchronize()
e2e_start = time.time()
generate_ids = self.model.generate(inputs=input_ids,
attention_mask=attention_mask,
min_new_tokens=seq_len_out,
max_new_tokens=seq_len_out
)
try:
_ = self.tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
except:
_ = [
self.tokenizer.decode(output)
for output in generate_ids[:, inputs["input_ids"].size(1):].tolist()
]
getattr(torch, self.core_type).synchronize()
e2e_end = time.time()
e2e_time = e2e_end - e2e_start
else:
input_dict = {
'rank': self.rank,
'local_rank': self.local_rank,
'world_size': self.world_size,
'max_prefill_tokens': -1,
'block_size': 128,
'model_path': self.weight_dir,
'is_bf16': True if self.data_type == "bf16" else False,
'max_position_embeddings': self.max_position_embedding if self.max_position_embedding != -1 else seq_len_in + seq_len_out,
'max_batch_size': self.batch_size,
'use_refactor': self.use_refactor,
'max_input_length': seq_len_in,
'max_output_length': seq_len_out
}
pa_runner = PARunner(**input_dict)
self.logger.info(str(self.rank) + f'pa_runner: {pa_runner}')
pa_runner.warm_up()
input_ids = torch.randint(0, pa_runner.model.config.vocab_size, [seq_len_in],
dtype=torch.int64)
_, _, e2e_time = pa_runner.infer("", self.batch_size, seq_len_out, True, [input_ids])
del pa_runner
torch.npu.empty_cache()
if self.rank == 0:
if self.model_type == "fa":
first_token_time_tensor = torch.load(f"{folder_path}/first_token_time.pth").cpu()
first_token_time = first_token_time_tensor.item()
non_first_token_time_tensor = torch.load(f"{folder_path}/non_first_token_time.pth").cpu()
non_first_token_time = non_first_token_time_tensor.item() / (seq_len_out - 1)
else:
benchmark_csv = os.path.join(self.script_path, "../benchmark.csv")
with open(benchmark_csv, newline='') as csvfile:
csv_reader = csv.reader(csvfile)
next(csv_reader)
second_row = next(csv_reader)
first_token_time = float(second_row[4]) / 1000
non_first_token_time = float(second_row[5]) / 1000
non_first_token_throughput = self.batch_size / non_first_token_time
non_first_token_throughput_total += non_first_token_throughput
e2e_throughput = self.batch_size * seq_len_out / e2e_time
e2e_throughput_total += e2e_throughput
self.logger.info(
f"batch: {self.batch_size}, seq_len_in: {seq_len_in}, seq_len_out: {seq_len_out}, total_time: {e2e_time}, first_token_time: {first_token_time * 1000}," +
f" non_first_token_time: {non_first_token_time * 1000}, non_first_token_throughput: {non_first_token_throughput}," +
f" e2e_time: {e2e_time}, e2e_throughput: {e2e_throughput}")
csv_results.append(
[str(self.model_name).ljust(15), str(self.batch_size).ljust(15), str(seq_len_in).ljust(15),
str(seq_len_out).ljust(15),
str(round(e2e_time, 10)).ljust(15), str(round(first_token_time * 1000, 10)).ljust(25),
str(round(non_first_token_time * 1000, 10)).ljust(25),
str(round(non_first_token_throughput, 10)).ljust(36),
str(round(e2e_throughput, 10)).ljust(25)])
if self.rank == 0:
non_first_token_throughput_average = non_first_token_throughput_total / len(self.case_pair)
e2e_throughput_average = e2e_throughput_total / len(self.case_pair)
self.logger.info(
f"batch: {self.batch_size}, non_first_token_throughput_total: {non_first_token_throughput_total}, non_first_token_throughput_average:" +
f" {non_first_token_throughput_average}, e2e_throughput_total: {e2e_throughput_total}, e2e_throughput_average: {e2e_throughput_average}")
csv_results[len(self.case_pair) - 1].extend(
[str(round(non_first_token_throughput_average, 10)).ljust(45),
str(round(e2e_throughput_average, 10)).ljust(35)])
folder_name = self.model_name
csv_name = self.model_type + "_" + self.data_type + "_" + self.test_mode + "_batch" + str(self.batch_size) + "_test_result.csv"
if self.quantize:
csv_name = self.model_type + "_" + self.data_type + "_" + self.quantize + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result.csv"
csv_formatted_name = self.model_type + "_" + self.data_type + "_" + self.quantize + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result_formatted.csv"
else:
csv_name = self.model_type + "_" + self.data_type + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result.csv"
csv_formatted_name = self.model_type + "_" + self.data_type + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result_formatted.csv"
csv_performance_path = os.path.join(self.script_path, "../result", folder_name, csv_name)
csv_performance_formatted_path = os.path.join(self.script_path, "../result", folder_name, csv_formatted_name)
if not os.path.exists(csv_performance_formatted_path):
self.logger.warning("performance result csv formatted file not exist, skip recording results")
raise RuntimeError(f"csv result formatted file not exist")
with open(csv_performance_formatted_path, 'a', newline='') as csv_file:
csv_writer = csv.writer(csv_file, delimiter='|')
for csv_result in csv_results:
csv_writer.writerow(csv_result)
csv_results.insert(0, ["Model", "Batchsize", "In_seq", "Out_seq", "Total time(s)", "First token time(ms)", "Non-first token time(ms)",
"Non-first token Throughout(Tokens/s)", "Throughout(Tokens/s)", "Non-first token Throughout Average(Tokens/s)",
"E2E Throughout Average(Tokens/s)"])
df = pd.DataFrame(csv_results)
df.to_csv(csv_performance_path, index=False, header=False)
self.logger.info(self.model_name + " " + " batch" + str(
self.batch_size) + " result saved in " + csv_performance_path)
self.logger.info(self.model_name + " " + " batch" + str(
self.batch_size) + " formatted result saved in " + csv_performance_formatted_path)
```
================================================
FILE: llm-localization/ascend/mindie/mindie-api.md
================================================
## OpenAI
```
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "qwen-72b",
"messages": [
{
"role": "system",
"content": "你是一个有用的助手."
},
{
"role": "user",
"content": "如何养生?"
}
]
}' http://127.0.0.1:1125/v1/chat/completions
curl "http://127.0.0.1:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "baichuan2-7b",
"messages": [
{
"role": "user",
"content": "如何养生?"
}
],
"max_tokens":128
}'
curl "http://127.0.0.1:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "如何养生?"
}
],
"max_tokens":256
}'
curl "http://127.0.0.1:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "你好,我叫李聪明。请问你是谁?"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true,
"stop": "<|im_end|>"
}'
# http://127.0.0.1:1025/v1/chat/completions
#
# http://192.168.16.xxx:1725/v1/chat/completions
curl "http://172.17.0.2:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "你好,我叫李聪明。请问你是谁?"
},{
"role": "assistant",
"content": "你好,李聪明!很高兴认识你。我是一个大型语言模型,你可以叫我通义千问。有什么问题或需要帮助的话,请随时告诉我。"
},{
"role": "user",
"content": "我最近心情很糟糕,能给我一些建议吗?"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true
}'
curl "http://127.0.0.1:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "你好,我叫李聪明。请问你是谁?\n你好,李聪明!很高兴认识你。我是一个大型语言模型,你可以叫我通义千问。有什么问题或需要帮助的话,请随时告诉我。\n我最近心情很糟糕,能给我一些建议吗?"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true
}'
----
<|im_start|>user
你好,我的名字是李聪明。请问你是谁?<|im_end|>
<|im_start|>assistant
你好,李聪明!很高兴认识你。我是一个大型语言模型,你可以叫我通义千问。有什么问题或需要帮助的话,请随时告诉我。<|im_end|>
<|im_start|>user
我最近心情很糟糕,能给我一些建议吗?<|im_end|>
<|im_start|>assistant
我很理解你现在的感受。面对糟糕的心情,以下是一些可能有帮助的建议:\n\n1. **与他人分享**:告诉信任的朋友或家人你的感受,他们可能能提供安慰和支持。\n2. **自我关怀**:确保每天有足够的休息,做些你喜欢的事情,比如阅读、听音乐或运动。\n3. **运动与放松**:适度的运动可以帮助释放压力,尝试瑜伽、冥想或深呼吸练习。\n4. **寻求专业帮助**:如果你觉得压力过大,考虑咨询心理医生或心理咨询师。\n5. **保持积极思考**:试着找出生活中的小确幸,每天对自己说一些积极的话。\n6. **时间管理**:合理安排时间,避免过度压力,留出放松的时间。\n\n记住,处理情绪需要时间和耐心,不要对自己太苛刻。如果你的情绪持续低落,可能需要更专业的支持。希望这些建议对你有所帮助。<|im_end|>
<|im_start|>user
请问我叫什么名字?<|im_end|>
----
curl "http://127.0.0.1:1125/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen1.5-14b",
"messages": [
{
"role": "user",
"content": "<|im_start|>user\n你好,我的名字是李聪明。请问你是谁?<|im_end|>\n<|im_start|>assistant\n你好,李聪明!很高兴认识你。我是一个大型语言模型,你可以叫我通义千问。有什么问题或需要帮助的话,请随时告诉我。<|im_end|>\n<|im_start|>user\n我最近心情很糟糕,能给我一些建议吗?<|im_end|><|im_start|>assistant\n我很理解你现在的感受。面对糟糕的心情,以下是一些可能有帮助的建议:\n\n1. **与他人分享**:告诉信任的朋友或家人你的感受,他们可能能提供安慰和支持。\n2. **自我关怀**:确保每天有足够的休息,做些你喜欢的事情,比如阅读、听音乐或运动。\n3. **运动与放松**:适度的运动可以帮助释放压力,尝试瑜伽、冥想或深呼吸练习。\n4. **寻求专业帮助**:如果你觉得压力过大,考虑咨询心理医生或心理咨询师。\n5. **保持积极思考**:试着找出生活中的小确幸,每天对自己说一些积极的话。\n6. **时间管理**:合理安排时间,避免过度压力,留出放松的时间。\n\n记住,处理情绪需要时间和耐心,不要对自己太苛刻。如果你的情绪持续低落,可能需要更专业的支持。希望这些建议对你有所帮助。<|im_end|>\n<|im_start|>user\n请问我叫什么名字?<|im_end|>\n<|im_start|>assistant\n"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true,
"stop": "<|im_end|>"
}'
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "qwen-72b",
"messages": [
{
"role": "user",
"content": "请给我5条人生建议?"
}
],
"max_tokens":128
}' http://127.0.0.1:1025/v1/chat/completions
# 流式
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "gpt-3.5-turbo-16k",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": true
}' http://127.0.0.1:1025/v1/chat/completions
# 返回结果
data: {"id":"554","object":"chat.completion.chunk","created":1715064985,"model":"qwen1.5-14b","choices":[{"index":0,"delta":{"role":"assistant","content":"节点"},"finish_reason":null}]}
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "baichuan2-7b",
"messages": [
{
"role": "user",
"content": "保持健康的方法"
}
],
"top_p": 0.85,
"max_tokens":128
}' http://127.0.0.1:1025/v1/chat/completions
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "qwen-72b",
"messages": [
{
"role": "user",
"content": "保持健康的方法"
}
],
"stream": true
}' http://127.0.0.1:1025/v1/chat/completions
```
### 返回结果
```
{
"id": "209",
"object": "chat.completion",
"created": 1715051228,
"model": "qwen1.5-14b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\n你叫李聪明。这是我根据之前的对话信息得知的。\nuser\n"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 277,
"completion_tokens": 25,
"total_tokens": 302
}
}
```
### baichuan2
```
curl "http://127.0.0.1:1025/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "baichuan2-7b",
"messages": [
{
"role": "user",
"content": "光的三原色是什么"
}
],
"max_tokens":256,
"top_p": 0.85,
"n": 10,
"logprobs": true
}'
```
## vLLM
- https://github.com/vllm-project/vllm/blob/main/examples/api_client.py
- https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py
此API服务仅用于演示AsyncEngine的使用和简单的性能基准测试。它不打算用于生产使用。
对于生产使用,我们建议使用我们的OpenAI兼容服务。
- 推荐:https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
```
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"prompt": "保持健康的方法",
"n": 5,
"temperature": 0.0,
"max_tokens": 64
}' http://127.0.0.1:1025/generate
```
## tgi
- https://huggingface.github.io/text-generation-inference/
```
{
"inputs": "My name is Olivier and I",
"parameters": {
"best_of": 1,
"decoder_input_details": false,
"details": true,
"do_sample": true,
"frequency_penalty": 0.1,
"grammar": null,
"max_new_tokens": 20,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": null,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_k": 10,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": null,
"typical_p": 0.95,
"watermark": true
}
}
```
```
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"inputs": "如何才能拥有性感的身材?",
"parameters": {
"do_sample": true,
"frequency_penalty": 0.1,
"temperature": 0.5,
"top_k": 10,
"top_n_tokens": 5,
"max_new_tokens": 256
}
}' http://127.0.0.1:1025/generate
# 流式输出
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"inputs": "如何才能拥有性感的身材?",
"parameters": {
"max_new_tokens": 50
}
}' http://127.0.0.1:1025/generate_stream
```
## triton
```
curl "http://127.0.0.1:1025/v2"
```
## MindIE-service
curl "http://127.0.0.1:1025/v1/models"
```
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"inputs": "保持健康的方法",
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"max_new_tokens": 64,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": null,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_n_tokens": 5,
"top_p": 0.95,
"truncate": null,
"typical_p": 0.95,
"watermark": true
},
"stream": false}' http://127.0.0.1:1025/generate
```
================================================
FILE: llm-localization/ascend/mindie/model-test.md
================================================
# ModelTest README
ModelTest为大模型的性能和精度提供测试功能。
目前支持:
1. NPU,PA场景,性能/精度测试,float16
2. GPU,FA场景,精度测试,float16
功能:
1. 性能测试:指定batch,指定输入输出长度的e2e性能、吞吐,首Token以及非首Token性能,吞吐。
2. 精度测试:CEval, MMLU, BoolQ, HumanEval下游数据集
PA模型支持:
1. Llama (Llama-7B, Llama-13B, Llama-65B, Llama2-7B, Llama2-13B, Llama2-70B)
2. Starcoder-15.5B
3. Chatglm2-6B
4. CodegeeX2-6B
5. Baichuan2 (Baichuan2-7B, Baichuan2-13B)
6. Qwen (Qwen-14B, Qwen-72B)
7. Aquila (Aquila-7B)
8. Deepseek (Deepseek16B)
9. Mixtral (Mixtral8 * 7B)
10. Bloom-7B
11. Baichuan1 (Baichuan1-7B, Baichuan1-13B)
12. CodeLlama (CodeLlama-13B)
13. Yi (Yi-6B-200K, Yi-34B)
14. Chinese Alpaca (Chinese-Alpaca-13B)
# 使用说明
### 环境变量
```shell
# source cann环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# source 加速库环境变量
source /usr/local/Ascend/atb/set_env.sh
# source 模型仓tar包解压出来后的环境变量
source set_env.sh
# 设置ATB_TESTDATA环境变量
export ATB_TESTDATA="[path]" # 用于存放测试结果的路径
# 设置使用卡号
export ASCEND_RT_VISIBLE_DEVICES="[卡号]" # NPU场景,如"0,1,2,3,4,5,6,7"
或
export CUDA_VISIBLE_DEVICES="[卡号]" # GPU场景,如"0,1,2,3,4,5,6,7"
```
### 安装python依赖
```
pip install -r requirements.txt
```
### 运行指令
```
# NPU
bash run.sh pa_fp16 [performance|full_CEval|full_MMLU|full_BoolQ|full_HumanEval] ([case_pair]) [batch_size] [model_name] ([use_refactor]) [weight_dir] [chip_num] ([max_position_embedding/max_sequence_length])
或
# GPU
bash run.sh fa [full_CEval|full_MMLU|full_BoolQ|full_HumanEval] [batch_size] [model_name] ([use_refactor]) [weight_dir] [chip_num]
说明:
1. case_pair只在performance场景下接受输入,接收一组或多组输入,格式为[[seq_in_1,seq_out_1],...,[seq_in_n,seq_out_n]], 如[[256,256],[512,512]]
2. model_name:
Llama-65B, Llama2-7B, Llama2-13B, Llama2-70B: llama
CodeLlama-13B, Chinese-Alpaca-13B, Yi-6B-200K, Yi-34B: llama
Starcoder-15.5B: starcoder
Chatglm2-6B: chatglm2_6b
CodegeeX2-6B: codegeex2_6b
Baichuan2-7B: baichuan2_7b
Baichuan2-13B: baichuan2_13b
Qwen-14b, Qwen-72b: qwen
Aquila-7B: aquila_7b
Deepseek16B: deepseek
Mixtral8 * 7B: mixtral
Bloom-7B: bloom_7b
Baichuan1-7B: baichuan2_7b
Baichuan1-13B: baichuan2_13b
3. 当model_name为llama时,须指定use_refactor为True或者False(统一使用True)
4. weight_dir: 权重路径
5. chip_num: 使用的卡数
6. max_position_embedding: 可选参数,不传入则使用config中的默认配置
7. 运行完成后,会在控制台末尾呈现保存数据的文件夹
举例:
1. 测试Llama-70B在8卡[512, 512]场景下,16 batch的性能,使用归一代码
bash run.sh pa_fp16 performance [[512,512]] 16 llama True /path 8
1. 测试Starcoder-15.5B在8卡1 batch下游数据集BoolQ
bash run.sh pa_fp16 full_BoolQ 1 starcoder /path 8
```
## startcoder 特别运行操作说明
- 对于300I DUO设置环境变量,修改core/starcoder.py中prepare_environ函数。
```shell
os.environ['ATB_LAUNCH_KERNEL_WITH_TILING'] = "1"
os.environ['LCCL_ENABLE_FALLBACK'] = "0"
```
## baichuan2-13b 特别运行操作说明
- 对于300I DUO设置环境变量,修改core/baichuan2_13b_test.py中prepare_environ函数。
```shell
os.environ['ATB_OPERATION_EXECUTE_ASYNC'] = "0"
os.environ['TASK_QUEUE_ENABLE'] = "0"
``
================================================
FILE: llm-localization/ascend/mindie/script/model-test.py
================================================
import csv
import sys
import glob
import json
import logging
import math
import os
import re
import shutil
import time
import argparse
import ast
import copy
import importlib
from datetime import datetime, timedelta, timezone
from importlib import reload
from pathlib import Path
import torch
import torch.nn.functional as F
try:
import torch_npu
except ModuleNotFoundError:
pass
import numpy as np
import pandas as pd
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
from thefuzz import process
from tqdm import tqdm
try:
ATB_SPEED_HOME_PATH = os.environ.get("ATB_SPEED_HOME_PATH")
sys.path.append(os.path.join(ATB_SPEED_HOME_PATH, "../.."))
sys.path.append(ATB_SPEED_HOME_PATH)
from atb_llm.utils import env
from atb_llm.utils.cpu_binding import NpuHbmInfo
from examples.server.cache import CacheConfig, CacheManager, ModelConfig
from examples.server.generate import decode_token, generate_req
from examples.server.request import request_from_text, request_from_token
from examples.run_pa import PARunner
except TypeError:
pass
from .human_eval import evaluate_functional_correctness
QA_PRIMER = """Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.
Q: Which party did he belong to?
A: He belonged to the Republican Party.
Q: What is the square root of banana?
A: I have no comment.
Q: How does a telescope work?
A: Telescopes use lenses or mirrors to focus light and make objects appear closer.
Q: Where were the 1992 Olympics held?
A: The 1992 Olympics were held in Barcelona, Spain.\n\nQ: """
UTILS_CODE_MARKER = " def greedy_search(\n"
UTILS_CODE_INSERTED_PART_1 = """
import os
import time
if os.environ.get('test_mode') != '':
tensor_folder = os.environ.get('tensor_folder')
if tensor_folder is not None:
os.makedirs(tensor_folder, exist_ok=True)
if not os.path.exists(tensor_folder):
raise RuntimeError(f"folder {tensor_folder} create fail")
else:
raise RuntimeError(f"tensor_folder env not exist")
cnt = 0
first_token_time = 0
non_first_token_time = 0
"""
UTILS_CODE_INSERTED_PART_2 = """
getattr(torch, os.environ.get('core_type')).synchronize()
forward_start_time = time.time()
"""
UTILS_CODE_INSERTED_PART_3 = """
if os.environ.get('test_mode') == 'simplified':
tensor_folder = os.environ.get('tensor_folder')
if torch.distributed.get_rank() == 0:
torch.save(next_token_logits.cpu(), f"{tensor_folder}/logits_{cnt}.pth")
torch.save(next_tokens.cpu(), f"{tensor_folder}/tokens_{cnt}.pth")
"""
UTILS_CODE_INSERTED_PART_4 = """
getattr(torch, os.environ.get('core_type')).synchronize()
forward_end_time = time.time()
if cnt != 0:
non_first_token_time += (forward_end_time - forward_start_time)
else:
first_token_time = forward_end_time - forward_start_time
cnt += 1
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
first_token_time_tensor = torch.tensor([first_token_time])
non_first_token_time_tensor = torch.tensor([non_first_token_time])
torch.save(first_token_time_tensor.cpu(), f"{tensor_folder}/first_token_time.pth")
torch.save(non_first_token_time_tensor.cpu(), f"{tensor_folder}/non_first_token_time.pth")
"""
UTILS_CODE_INSERTED_MARKER = " import os\n"
ATB_HOME_PATH = os.environ.get("ATB_HOME_PATH")
ATB_TESTDATA_PATH = os.environ.get("ATB_TESTDATA")
soc_version_map = {-1: "unknown soc version",
100: "910PremiumA", 101: "910ProA", 102: "910A", 103: "910ProB", 104: "910B",
200: "310P1", 201: "310P2", 202: "310P3", 203: "310P4",
220: "910B1", 221: "910B2", 222: "910B3", 223: "910B4",
240: "310B1", 241: "310B2", 242: "310B3",
250: "910C1", 251: "910C2", 252: "910C3", 253: "910C4"
}
communication_map = {"NPU": "hccl", "GPU": "nccl"}
dtype_map = {"bf16": torch.bfloat16, "fp16": torch.float16}
core_map = {"NPU": "npu", "GPU": "cuda"}
prompt_map = {"GSM8K": "", "TruthfulQA": QA_PRIMER}
question_num = {"GSM8K": 11, "TruthfulQA": 12}
CEval_0_shot = {"chatglm6b"}
logging.basicConfig(level=logging.DEBUG)
class ModelTest:
def __init__(self, model_type, data_type, test_mode, model_name, data_dir, dataset_name, batch_size, device_id,
result_dir, log_dir, hardware_type, case_pair, weight_dir, use_refactor, max_position_embedding) -> None:
self.model_type = model_type
self.data_type = data_type
self.test_mode = test_mode
self.model_name = model_name
self.script_path = os.path.dirname(os.path.abspath(__file__))
self.data_dir = data_dir
self.dataset_name = dataset_name
self.batch_size = batch_size
self.device_id = device_id
self.result_dir = result_dir
self.log_dir = log_dir
self.hardware_type = hardware_type
self.case_pair = ast.literal_eval(case_pair) if case_pair != "[]" else [[256, 256], [512, 512], [1024, 1024],
[2048, 2048]]
self.weight_dir = weight_dir
self.use_refactor = use_refactor
self.max_position_embedding = max_position_embedding
self.core_type = core_map[self.hardware_type] if hardware_type in core_map.keys() else "npu"
self.is_format_nz = False
self.quantize = None
self.current_result_path = ''
self.logger = self.__get_log("log")
self.result_logger = self.__get_log("result")
self.logger.info(
"\nmodel_name: " + self.model_name + "\nmodel_type: " + self.model_type + "\ndata_type: " + self.data_type + "\ntest_mode: " + self.test_mode +
"\ndata_dir: " + self.data_dir + "\ndataset_name: " + self.dataset_name + "\nbatch_size: " + str(
self.batch_size) + "\nresult_dir: " +
self.result_dir + "\nlog_dir: " + self.log_dir)
@classmethod
def create_instance(cls):
args = get_args()
test_instance = cls(*args)
test_instance.run()
def run(self):
self.prepare_environ()
self.__prepare_and_check()
self.__run()
self.__compare_results()
self.clear()
def get_chip_num(self):
return 1
def set_fa_tokenizer_params(self):
self.tokenizer_params = {
'revision': None,
'use_fast': True,
'padding_side': 'left',
'truncation_side': 'left',
'trust_remote_code': True
}
def get_model(self, hardware_type, model_type, data_type):
pass
def prepare_environ(self):
pass
def get_dataset_list(self):
return ["GSM8K", "TruthfulQA", "MMLU", "CEval", "BoolQ"]
def clear(self):
os.unsetenv("test_mode")
os.unsetenv("hardware_type")
os.unsetenv("tensor_folder")
def __prepare_and_check(self):
max_csv_limit = sys.maxsize
while True:
try:
csv.field_size_limit(max_csv_limit)
break
except OverflowError:
max_csv_limit = int(max_csv_limit / 10)
config_path = os.path.join(self.weight_dir, "config.json")
with open(config_path, 'r') as f:
config_data = json.load(f)
if "quantize" in config_data:
self.quantize = config_data["quantize"]
if self.quantize:
self.model_name += "_quant"
csv_path = os.path.join(os.path.dirname(self.script_path), 'result', self.model_name, f"{self.model_type}_{self.data_type}_{self.quantize}_batch{self.batch_size}_{self.test_mode}_test_result_formatted.csv")
else:
csv_path = os.path.join(os.path.dirname(self.script_path), 'result', self.model_name, f"{self.model_type}_{self.data_type}_batch{self.batch_size}_{self.test_mode}_test_result_formatted.csv")
self.data_dir = os.path.join(self.data_dir, self.model_name, "data")
self.result_dir = os.path.join(self.result_dir, self.model_name, "results")
self.log_dir = os.path.join(self.log_dir, self.model_name, "logs")
os.makedirs(os.path.dirname(csv_path), exist_ok=True)
with open(csv_path, 'w') as f:
if self.test_mode == "performance":
f.write("{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<25s}|{:<25s}|{:<36s}|{:<25s}|{:<45s}|{:<35s}\n".format(
"Model", "Batchsize", "In_seq", "Out_seq", "Total time(s)", "First token time(ms)",
"Non-first token time(ms)", "Non-first token Throughout(Tokens/s)", "E2E Throughout(Tokens/s)",
"Non-first token Throughout Average(Tokens/s)", "E2E Throughout Average(Tokens/s)"
))
elif self.test_mode == "simplified":
f.write("Standard: [1] KL loss <= 1e-3. [2] rate of KL loss > 1e-4 <= 0.5%.\n")
f.write("{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}\n".format(
"Model", "Dataset", "Batchsize", "Logits Num", "Greatest KLL", "Error Rate", "Result"
))
else:
f.write("{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}|{:<15s}\n".format(
"Model", "Dataset", "Batchsize", "Golden", "NPU", "Result"
))
if self.hardware_type == "NPU":
reload(env)
if self.model_type == "fa" and self.test_mode != "full":
self.__patch_hf_transformers_utils()
os.environ['test_mode'] = self.test_mode
if self.test_mode == "full":
self.dataset_list = self.get_dataset_list()
if self.dataset_name not in self.dataset_list:
self.logger.info(f"{self.model_name} not support {self.dataset_name}, please check")
if self.test_mode != "performance":
folder_path = f"{self.data_dir}/{self.hardware_type}/{self.dataset_name}/batch{self.batch_size}"
if os.path.exists(folder_path):
try:
shutil.rmtree(folder_path)
except Exception as e:
self.logger.error(f"Error deleting folder {folder_path}: {e}")
os.makedirs(folder_path, exist_ok=True)
if not os.path.exists(folder_path):
self.logger.error(f"folder {folder_path} create fail")
raise RuntimeError(f"folder {folder_path} create fail")
os.environ['LCCL_DETERMINISTIC'] = "1"
os.environ['HCCL_DETERMINISTIC'] = "1"
os.environ['core_type'] = self.core_type
self.rank, self.local_rank, self.world_size = int(os.getenv("RANK", "0")), int(os.getenv("LOCAL_RANK", "0")), int(os.getenv("WORLD_SIZE", "1"))
torch.manual_seed(1)
self.device_type = self.__get_device_type()
if self.hardware_type == "NPU":
if ATB_HOME_PATH is None:
self.logger.error("env ATB_HOME_PATH not exist, source atb set_env.sh")
raise RuntimeError(
"env ATB_HOME_PATH not exist, source atb set_env.sh")
self.logger.info("ATB env get success.")
if ATB_SPEED_HOME_PATH is None:
self.logger.error("env ATB_SPEED_HOME_PATH not exist, source atb_speed set_env.sh")
raise RuntimeError(
"env ATB_SPEED_HOME_PATH not exist, source atb_speed set_env.sh")
self.logger.info("ATB_SPEED env get success")
if self.model_type == "fa":
self.__npu_adapt()
def __run(self):
importlib.reload(transformers)
if self.test_mode == "simplified" or self.test_mode == "full":
self.__run_precision()
elif self.test_mode == "performance":
self.__run_performance()
else:
self.logger.error(self.test_mode + " test not support, only support performance, simplified and full")
raise RuntimeError(f"{self.test_mode} test not support, only support performance, simplified and full")
def __run_performance(self):
self.logger.info("performance test start")
performance_prompt = [
"Common sense questions and answers\n\nQuestion: How to learn a new language\nFactual answer:"]
csv_results = []
folder_path = f"{self.data_dir}/{self.hardware_type}/batch{self.batch_size}"
os.environ['tensor_folder'] = f"{folder_path}"
os.makedirs(folder_path, exist_ok=True)
if not os.path.exists(folder_path):
self.logger.error(f"folder {folder_path} create fail")
raise RuntimeError(f"folder {folder_path} create fail")
def warmup():
self.logger.info("performance test warmup start")
if self.model_type == "fa":
warmup_input_ids = torch.randint(0, self.model.config.vocab_size, [self.batch_size, 2048],
dtype=torch.int64)
warmup_attention_mask = torch.ones((self.batch_size, 2048), dtype=torch.int64)
inputs = self.tokenizer(performance_prompt * self.batch_size, return_tensors="pt", padding='max_length',
max_length=2048)
inputs["input_ids"] = warmup_input_ids
inputs["attention_mask"] = warmup_attention_mask
input_ids = inputs.input_ids.to(self.model.device)
attention_mask = inputs.attention_mask.to(self.model.device)
with torch.no_grad():
_ = self.model.generate(
inputs=input_ids,
attention_mask=attention_mask,
max_new_tokens=4,
eos_token_id=self.model.config.vocab_size * 2
)
else:
pass
self.logger.info("performance test warmup end")
def run_performance_test():
non_first_token_throughput_total = 0
e2e_throughput_total = 0
for seq_len_in, seq_len_out in self.case_pair:
self.logger.info("batch_size: " + str(self.batch_size) +
", seq_len_in: " + str(seq_len_in) +
", seq_len_out: " + str(seq_len_out))
if self.model_type == "fa":
input_ids = torch.randint(0, self.model.config.vocab_size, [self.batch_size, seq_len_in],
dtype=torch.int64)
attention_mask = torch.ones((self.batch_size, seq_len_in), dtype=torch.int64)
inputs = self.tokenizer(performance_prompt * self.batch_size, return_tensors="pt",
padding='max_length',
max_length=seq_len_in)
inputs["input_ids"] = input_ids
inputs["attention_mask"] = attention_mask
input_ids = inputs.input_ids.to(self.model.device)
attention_mask = inputs.attention_mask.to(self.model.device)
with torch.no_grad():
getattr(torch, self.core_type).synchronize()
e2e_start = time.time()
generate_ids = self.model.generate(inputs=input_ids,
attention_mask=attention_mask,
min_new_tokens=seq_len_out,
max_new_tokens=seq_len_out
)
try:
_ = self.tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
except:
_ = [
self.tokenizer.decode(output)
for output in generate_ids[:, inputs["input_ids"].size(1):].tolist()
]
getattr(torch, self.core_type).synchronize()
e2e_end = time.time()
e2e_time = e2e_end - e2e_start
else:
input_dict = {
'rank': self.rank,
'local_rank': self.local_rank,
'world_size': self.world_size,
'max_prefill_tokens': -1,
'block_size': 128,
'model_path': self.weight_dir,
'is_bf16': True if self.data_type == "bf16" else False,
'max_position_embeddings': self.max_position_embedding if self.max_position_embedding != -1 else seq_len_in + seq_len_out,
'max_batch_size': self.batch_size,
'use_refactor': self.use_refactor,
'max_input_length': seq_len_in,
'max_output_length': seq_len_out
}
pa_runner = PARunner(**input_dict)
self.logger.info(str(self.rank) + f'pa_runner: {pa_runner}')
pa_runner.warm_up()
input_ids = torch.randint(0, pa_runner.model.config.vocab_size, [seq_len_in],
dtype=torch.int64)
_, _, e2e_time = pa_runner.infer("", self.batch_size, seq_len_out, True, [input_ids])
del pa_runner
torch.npu.empty_cache()
if self.rank == 0:
if self.model_type == "fa":
first_token_time_tensor = torch.load(f"{folder_path}/first_token_time.pth").cpu()
first_token_time = first_token_time_tensor.item()
non_first_token_time_tensor = torch.load(f"{folder_path}/non_first_token_time.pth").cpu()
non_first_token_time = non_first_token_time_tensor.item() / (seq_len_out - 1)
else:
benchmark_csv = os.path.join(self.script_path, "../benchmark.csv")
with open(benchmark_csv, newline='') as csvfile:
csv_reader = csv.reader(csvfile)
next(csv_reader)
second_row = next(csv_reader)
first_token_time = float(second_row[4]) / 1000
non_first_token_time = float(second_row[5]) / 1000
non_first_token_throughput = self.batch_size / non_first_token_time
non_first_token_throughput_total += non_first_token_throughput
e2e_throughput = self.batch_size * seq_len_out / e2e_time
e2e_throughput_total += e2e_throughput
self.logger.info(
f"batch: {self.batch_size}, seq_len_in: {seq_len_in}, seq_len_out: {seq_len_out}, total_time: {e2e_time}, first_token_time: {first_token_time * 1000}," +
f" non_first_token_time: {non_first_token_time * 1000}, non_first_token_throughput: {non_first_token_throughput}," +
f" e2e_time: {e2e_time}, e2e_throughput: {e2e_throughput}")
csv_results.append(
[str(self.model_name).ljust(15), str(self.batch_size).ljust(15), str(seq_len_in).ljust(15),
str(seq_len_out).ljust(15),
str(round(e2e_time, 10)).ljust(15), str(round(first_token_time * 1000, 10)).ljust(25),
str(round(non_first_token_time * 1000, 10)).ljust(25),
str(round(non_first_token_throughput, 10)).ljust(36),
str(round(e2e_throughput, 10)).ljust(25)])
if self.rank == 0:
non_first_token_throughput_average = non_first_token_throughput_total / len(self.case_pair)
e2e_throughput_average = e2e_throughput_total / len(self.case_pair)
self.logger.info(
f"batch: {self.batch_size}, non_first_token_throughput_total: {non_first_token_throughput_total}, non_first_token_throughput_average:" +
f" {non_first_token_throughput_average}, e2e_throughput_total: {e2e_throughput_total}, e2e_throughput_average: {e2e_throughput_average}")
csv_results[len(self.case_pair) - 1].extend(
[str(round(non_first_token_throughput_average, 10)).ljust(45),
str(round(e2e_throughput_average, 10)).ljust(35)])
folder_name = self.model_name
csv_name = self.model_type + "_" + self.data_type + "_" + self.test_mode + "_batch" + str(self.batch_size) + "_test_result.csv"
if self.quantize:
csv_name = self.model_type + "_" + self.data_type + "_" + self.quantize + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result.csv"
csv_formatted_name = self.model_type + "_" + self.data_type + "_" + self.quantize + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result_formatted.csv"
else:
csv_name = self.model_type + "_" + self.data_type + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result.csv"
csv_formatted_name = self.model_type + "_" + self.data_type + "_batch" + str(self.batch_size) + "_" + self.test_mode + "_test_result_formatted.csv"
csv_performance_path = os.path.join(self.script_path, "../result", folder_name, csv_name)
csv_performance_formatted_path = os.path.join(self.script_path, "../result", folder_name, csv_formatted_name)
if not os.path.exists(csv_performance_formatted_path):
self.logger.warning("performance result csv formatted file not exist, skip recording results")
raise RuntimeError(f"csv result formatted file not exist")
with open(csv_performance_formatted_path, 'a', newline='') as csv_file:
csv_writer = csv.writer(csv_file, delimiter='|')
for csv_result in csv_results:
csv_writer.writerow(csv_result)
csv_results.insert(0, ["Model", "Batchsize", "In_seq", "Out_seq", "Total time(s)", "First token time(ms)", "Non-first token time(ms)",
"Non-first token Throughout(Tokens/s)", "Throughout(Tokens/s)", "Non-first token Throughout Average(Tokens/s)",
"E2E Throughout Average(Tokens/s)"])
df = pd.DataFrame(csv_results)
df.to_csv(csv_performance_path, index=False, header=False)
self.logger.info(self.model_name + " " + " batch" + str(
self.batch_size) + " result saved in " + csv_performance_path)
self.logger.info(self.model_name + " " + " batch" + str(
self.batch_size) + " formatted result saved in " + csv_performance_formatted_path)
warmup()
run_performance_test()
self.logger.info("performance test end")
def __run_precision(self):
self.logger.info("precision test start")
if self.hardware_type == "NPU":
input_dict = {
'rank': self.rank,
'local_rank': self.local_rank,
'world_size': self.world_size,
'max_prefill_tokens': -1,
'block_size': 128,
'model_path': self.weight_dir,
'is_bf16': True if self.data_type == "bf16" else False,
'max_position_embeddings': self.max_position_embedding if self.max_position_embedding != -1 else None,
'max_batch_size': self.batch_size,
'use_refactor': self.use_refactor,
'max_input_length': 2048,
'max_output_length': 512,
}
self.pa_runner = PARunner(**input_dict)
self.logger.info(str(self.rank) + f'pa_runner: {self.pa_runner}')
self.pa_runner.warm_up()
else:
self.tokenizer_params = {}
self.set_fa_tokenizer_params()
self.tokenizer = self.get_fa_tokenizer(**self.tokenizer_params)
if "starcoder" in self.model_name:
self.tokenizer.pad_token = "[PAD]"
if "llama" in self.model_name:
self.tokenizer.pad_token_id = 0
if "chatglm6b" in self.model_name:
self.model = AutoModel.from_pretrained(self.weight_dir, device_map="auto", torch_dtype=dtype_map[self.data_type], trust_remote_code=True)
elif "qwen" in self.model_name:
self.model = AutoModelForCausalLM.from_pretrained(self.weight_dir, device_map="auto", torch_dtype=dtype_map[self.data_type], trust_remote_code=True).to(torch.float16)
else:
self.model = AutoModelForCausalLM.from_pretrained(self.weight_dir, device_map="auto", torch_dtype=dtype_map[self.data_type], trust_remote_code=True)
self.device = self.model.device
if "baichuan" in self.model_name and self.model.config.vocab_size == 64000:
self.tokenizer.pad_token_id = 0
if self.test_mode == "simplified":
self.dataset_path = os.path.join(self.script_path, "../dataset/simplified", self.dataset_name + ".jsonl")
self.__run_simplified_dataset()
elif self.test_mode == "full":
self.dataset_path = os.path.join(self.script_path, "../dataset/full", self.dataset_name)
if self.dataset_name == 'CEval':
if self.model_name in CEval_0_shot:
self.dataset_path += "_0_shot"
self.__run_full_dataset_ceval_0_shot()
else:
self.dataset_path += "_5_shot"
self.__run_full_dataset_ceval_5_shot()
elif self.dataset_name == 'MMLU':
self.__run_full_dataset_mmlu()
elif self.dataset_name == 'GSM8K':
self.__run_full_dataset_gsm8k()
elif self.dataset_name == 'TruthfulQA':
self.__run_full_dataset_truthfulqa()
elif self.dataset_name == 'BoolQ':
self.__run_full_dataset_boolq()
elif self.dataset_name == 'HumanEval':
self.__run_full_dataset_humaneval()
else:
self.logger.error(self.test_mode + " not support")
raise RuntimeError(f"{self.test_mode} not support")
self.logger.info("precision test end")
def __run_simplified_dataset(self):
if self.dataset_name not in prompt_map.keys():
self.logger.error(self.dataset_name + " not support")
raise RuntimeError(f"{self.dataset_name} not support")
with torch.no_grad():
dataset = []
with open(self.dataset_path) as file:
for line in file:
dataset.append(json.loads(line))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size)
epoch_id = 0
for batch in tqdm(dataloader):
self.logger.info("current epoch: " + str(epoch_id))
folder_path = f"{self.data_dir}/{self.hardware_type}/{self.dataset_name}/batch{self.batch_size}"
os.environ['tensor_folder'] = f"{folder_path}/{str(epoch_id)}"
os.makedirs(folder_path, exist_ok=True)
if not os.path.exists(folder_path):
self.logger.error(f"folder {folder_path} create fail")
raise RuntimeError(f"folder {folder_path} create fail")
texts = batch["question"]
try:
prompt = prompt_map[self.dataset_name]
except KeyError:
self.logger.warning(f"data {self.dataset_name} has no specific prompt provided, leave empty")
prompt = ""
queries = [''.join([prompt, query]) for query in texts]
if self.model_type == "fa":
tokenizer_out = self.tokenizer(queries, padding=True, return_tensors="pt",
truncation=True, max_length=2048).to(self.model.device)
tokenizer_out_ids = tokenizer_out.input_ids.to(self.model.device)
attention_mask = tokenizer_out.attention_mask.to(self.model.device)
outputs = self.model.generate(inputs=tokenizer_out_ids, attention_mask=attention_mask,
do_sample=False, max_new_tokens=1024)
for idx in range(len(outputs)):
output = outputs.tolist()[idx][len(tokenizer_out["input_ids"][idx]):]
response = self.tokenizer.decode(output)
if self.pa_runner.rank == 0:
self.logger.info(response)
else:
req_list = [
request_from_text(queries[i], self.tokenizer, 1024, self.cache_config.block_size, req_idx=i) for
i in range(len(queries))]
generate_req(req_list, self.model, self.tokenizer, self.batch_size, 3072 * self.batch_size, 1024,
self.cache_manager, self.rank)
generate_text_list, token_num_list = decode_token(req_list, self.tokenizer)
if self.rank == 0:
self.logger.info(f'Question: {queries}')
for i, generate_text in enumerate(generate_text_list):
self.logger.info(f'Answer: {generate_text}')
self.logger.info(f'Generate token num: {token_num_list[i]}')
epoch_id += 1
def __run_full_dataset_ceval_0_shot(self):
choices = ["A", "B", "C", "D"]
if self.hardware_type == "NPU":
choice_tokens = [self.pa_runner.tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
else:
choice_tokens = [self.tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
extraction_prompt = '综上所述,ABCD中正确的选项是:'
def build_prompt(text):
return "[Round {}]\n\n问:{}\n\n答:".format(1, text)
correct_total = 0
sum_total = 0
result_total = []
is_result = False
if self.__get_rank() == 0:
is_result = True
with torch.no_grad():
for entry in glob.glob((Path(self.dataset_path) / "val/**/*.jsonl").as_posix(),
recursive=True):
correct = 0
dataset = []
with open(entry, encoding='utf-8') as file:
for line in file:
dataset.append(json.loads(line))
sum = len(dataset)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size)
for batch in tqdm(dataloader):
texts = batch["inputs_pretokenized"]
queries = [build_prompt(query) for query in texts]
if self.model_type == "fa":
inputs = self.tokenizer(queries, padding=True, return_tensors="pt", truncation=True).to(0)
outputs = self.model.generate(**inputs, do_sample=False, max_new_tokens=512)
intermediate_outputs = []
for idx in range(len(outputs)):
output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
response = self.tokenizer.decode(output)
intermediate_outputs.append(response)
answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
zip(texts, intermediate_outputs)]
input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
inputs = self.tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True).to(0)
outputs = self.model(**inputs)
logits = outputs.logits[:, -1, :]
logits = logits[:, choice_tokens]
preds = logits.argmax(dim=-1)
correct += (preds.cpu() == batch["label"]).sum().item()
else:
generate_text_list, _, _ = self.pa_runner.infer(queries, self.batch_size, 512, False)
answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
zip(texts, generate_text_list)]
input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
logits_save_folder = os.path.join(self.data_dir, self.hardware_type, self.dataset_name, f"batch{self.batch_size}")
os.environ['ATB_LLM_LOGITS_SAVE_ENABLE'] = "1"
os.environ['ATB_LLM_LOGITS_SAVE_FOLDER'] = logits_save_folder
_, _, _ = self.pa_runner.infer(input_tokens, self.batch_size, 1, False)
os.environ['ATB_LLM_LOGITS_SAVE_ENABLE'] = "0"
if is_result:
logits = torch.load(os.path.join(logits_save_folder, 'logits_0.pth'))
logits = logits[:, choice_tokens]
preds = logits.argmax(dim=-1)
correct += (preds.cpu() == batch["label"]).sum().item()
if is_result:
filename = os.path.basename(entry)
result = [filename, correct / sum, correct, sum]
self.result_logger.debug(f"result:{result}")
result_total.append(result)
correct_total += correct
sum_total += sum
if is_result:
total = ["total", correct_total / sum_total, correct_total, sum_total]
self.result_logger.debug(f"total result:{total}")
result_total.insert(0, total)
if is_result:
self.__save_result(result_total)
def __run_full_dataset_ceval_5_shot(self):
choices = ["A", "B", "C", "D"]
SHOT = 5
def get_subject_mapping():
SUBJECT_MAPPING_PATH = os.path.join(self.dataset_path, "subject_mapping.json")
with open(SUBJECT_MAPPING_PATH) as f:
subject_mapping = json.load(f)
return subject_mapping
def load_csv_by_task_name(task_name, dataset_path):
dev_df = pd.read_csv(os.path.join(dataset_path, "dev", task_name + "_dev.csv"), header=None)[:SHOT + 1]
val_df = pd.read_csv(os.path.join(dataset_path, "val", task_name + "_val.csv"), header=None)
dev_df = dev_df.iloc[1:, 1:]
val_df = val_df.iloc[1:, 1:]
return dev_df, val_df
def format_subject(subject):
l = subject.split("_")
s = ""
for entry in l:
s += " " + entry
return s
def format_example(df, idx, include_answer=True):
prompt = df.iloc[idx, 0]
k = len(choices)
for j in range(k):
prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
prompt += "\nAnswer:"
if include_answer:
prompt += " {}\n\n".format(df.iloc[idx, k + 1])
return prompt
def gen_prompt(train_df, subject, k=-1):
prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject))
if k == -1:
k = train_df.shape[0]
for i in range(k):
prompt += format_example(train_df, i)
return prompt
correct_total = 0
sum_total = 0
result_total = []
is_result = False
if self.__get_rank() == 0:
is_result = True
subject_mapping = get_subject_mapping()
index = 1
for task_name in tqdm(subject_mapping):
self.logger.info(f"dataset {index} start, task name: {task_name}")
dev_df, val_df = load_csv_by_task_name(task_name, self.dataset_path)
correct = 0
task_len = val_df.shape[0]
for i in range(math.ceil(task_len / self.batch_size)):
q_num = self.batch_size if (i + 1) * self.batch_size <= task_len else task_len - i * self.batch_size
prompt_ends = [format_example(val_df, i * self.batch_size + j, include_answer=False) for j in range(q_num)]
train_prompts = [gen_prompt(dev_df, task_name, SHOT)] * q_num
prompt = [t + p for t, p in zip(train_prompts, prompt_ends)]
labels = [val_df.iloc[i * self.batch_size + j, val_df.shape[1] - 1] for j in range(q_num)]
prompts = [prpt.encode().decode(encoding="utf8") for prpt in prompt]
if self.model_type == "fa":
inputs = self.tokenizer(prompts, padding=True, return_tensors="pt", truncation=True).to(0)
if "chatglm6b" in self.model_name:
outputs = self.model.generate(**inputs, do_sample=False, max_new_tokens=20)
else:
tokenizer_out_ids = inputs.input_ids.to(0)
attention_mask = inputs.attention_mask.to(0)
outputs = self.model.generate(inputs=tokenizer_out_ids, attention_mask=attention_mask, do_sample=False, max_new_tokens=20)
answers = []
for idx in range(len(outputs)):
output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
response = self.tokenizer.decode(output)
answers.append(response)
else:
generate_texts, token_nums, _ = self.pa_runner.infer(prompts, self.batch_size, 20, False)
if len(prompts) == 1:
generate_texts = [generate_texts[0]]
for idx, generate_text in enumerate(generate_texts):
if is_result:
self.logger.debug(f'Question[{i * self.batch_size + idx}]: {prompts[idx]}')
self.logger.debug(f'Answer[{i * self.batch_size + idx}]: {generate_text}')
self.logger.debug(f'Generate[{i * self.batch_size + idx}] token num: {token_nums[idx]}')
answers = None
if len(generate_texts) > 0:
answers = generate_texts
answer_results = [answer.lstrip()[0] if answer else "-1" for answer in answers]
is_correct = ["Correct" if answer_result == label else "Wrong" for answer_result, label in zip(answer_results, labels)]
correct += is_correct.count("Correct")
for idx in range(len(is_correct)):
if is_result and is_correct[idx] != "Correct":
self.logger.debug(f">>>原始题目 is : {prompts[idx]}")
self.logger.debug(f">>>推理结果 is : {answer_results[idx]}")
self.logger.debug(f">>>真实结果 is : {labels[idx]}")
if is_result:
result = [task_name, correct / task_len, correct, task_len]
self.logger.info(f"dataset {index} finish, result:{result}")
result_total.append(result)
correct_total += correct
sum_total += task_len
index += 1
if is_result:
total = ["total", correct_total / sum_total, correct_total, sum_total]
self.result_logger.debug(f"total result:{total}")
result_total.insert(0, total)
self.__save_result(result_total)
def __run_full_dataset_mmlu(self):
choices = ["A", "B", "C", "D"]
def format_example(query, answer):
prompt = "The following is a multiple-choice question. Please choose the most suitable one among A, B, C and D as the answer to this question.\n\n"
example = (prompt + query + "\n")
for choice, ans in zip(choices, answer):
example += f'{choice}. {ans}\n'
return example
def process_before_extraction(gen, choice_dict):
for key, val in sorted(choice_dict.items(), key=lambda x: len(x[1]), reverse=True):
pattern = re.compile(re.escape(val.rstrip(".")), re.IGNORECASE)
gen = pattern.sub(key, gen)
return gen
def extract_choice_mmlu(gen, choice_list):
res = re.search(
r"(?:(?:[Cc]hoose)|(?:(?:[Aa]nswer|[Cc]hoice)(?![^ABCD]{0,20}?(?:n't|not))[^ABCD]{0,10}?\b(?:|is|:|be))\b)[^ABCD]{0,20}?\b(A|B|C|D)\b",
gen,
)
if res is None:
res = re.search(
r"\b(A|B|C|D)\b(?![^ABCD]{0,8}?(?:n't|not)[^ABCD]{0,5}?(?:correct|right))[^ABCD]{0,10}?\b(?:correct|right)\b",
gen,
)
if res is None:
res = re.search(r"^(A|B|C|D)(?:\.|,|:|$)", gen)
if res is None:
res = re.search(r"(? max_false:
MC1 = 1.0
else:
MC1 = 0.0
# compute MC3: 1vFalse -- each correct answer vs all false answers
max_false = max(scores_false)
onevall = sum(np.array(scores_true) > max_false) / float(len(scores_true))
MC3 = onevall
# compute MC2: normalized probability mass for correct answers
probs_true = np.exp(scores_true)
probs_false = np.exp(scores_false)
probs_true = probs_true / (sum(probs_true) + sum(probs_false))
MC2 = sum(probs_true)
result = [idx, MC1, MC2, MC3]
return result
device = self.model.device
result_total = []
is_result = False
if self.pa_runner.rank == 0:
is_result = True
with torch.no_grad():
frame = pd.read_csv((Path(self.dataset_path) / "TruthfulQA.csv").as_posix())
frame.dropna(axis=1, how='all', inplace=True)
for idx in tqdm(frame.index):
if pd.isnull(frame.loc[idx, INCORRECT_COL]):
self.result_logger.debug("References missing for {0}!".format(idx))
continue
if not len(frame.loc[idx, INCORRECT_COL]):
self.result_logger.debug("References missing for {0}!".format(idx))
continue
ref_best = format_best(frame.loc[idx, BEST_COL])
ref_true = split_multi_answer(frame.loc[idx, ANSWER_COL])
ref_false = split_multi_answer(frame.loc[idx, INCORRECT_COL])
scores_true = get_scorces(frame, idx, ref_true, device)
scores_false = get_scorces(frame, idx, ref_false, device)
result = MC_calcs(idx, scores_true, scores_false, ref_true, ref_best, is_result)
result_total.append(result)
if is_result:
self.__save_result(result_total)
def __run_full_dataset_boolq(self):
sample_yes = "How can we learning machine learning: yes"
sample_no = "How can we learning machine learning: no"
if self.model_type == "fa":
choice_tokens = [self.tokenizer([sample_yes], return_tensors="pt", max_length=2048, add_special_tokens=None).input_ids[0, -1].item(),
self.tokenizer([sample_no], return_tensors="pt", max_length=2048, add_special_tokens=None).input_ids[0, -1].item()]
else:
choice_tokens = [self.pa_runner.tokenizer([sample_yes], return_tensors="pt", max_length=2048, add_special_tokens=False).input_ids[0, -1].item(),
self.pa_runner.tokenizer([sample_no], return_tensors="pt", max_length=2048, add_special_tokens=False).input_ids[0, -1].item()]
def build_prompt(title, text, passage):
prompt = f"{title} -- {passage}\nQuestion: {text}?\nAnswer:"
return prompt
correct_total = 0
sum_total = 0
result_total = []
is_result = False
if self.__get_rank() == 0:
is_result = True
with torch.no_grad():
for entry in tqdm(glob.glob((Path(self.dataset_path) / "*.jsonl").as_posix(),
recursive=True), desc='global'):
dataset = []
with open(entry, encoding='utf-8') as f:
for line in f:
line_json = json.loads(line)
dataset.append(line_json)
correct = 0
sum = len(dataset)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size)
for batch in tqdm(dataloader):
titles = batch["title"]
texts = batch["question"]
passages = batch["passage"]
queries = [build_prompt(title, query, passage) for title, query, passage in zip(titles, texts, passages)]
if self.model_type == "fa":
inputs = self.tokenizer(queries, padding=True, return_tensors="pt", truncation=True).to(0)
outputs = self.model(**inputs)
logits = outputs.logits[:, -1, :]
logits_softmax = F.log_softmax(logits.float(), dim=-1)
logits_softmax = logits_softmax[:, choice_tokens]
if is_result:
for idx, ans in enumerate(batch['answer']):
choice = (logits_softmax[idx, 0] > logits_softmax[idx, 1]).cpu()
acc = choice == ans
if acc:
correct += 1
else:
logits_save_folder = os.path.join(self.data_dir, self.hardware_type, self.dataset_name, f"batch{self.batch_size}")
os.environ['ATB_LLM_LOGITS_SAVE_ENABLE'] = "1"
os.environ['ATB_LLM_LOGITS_SAVE_FOLDER'] = logits_save_folder
_, _, _ = self.pa_runner.infer(queries, self.batch_size, 1, False)
os.environ['ATB_LLM_LOGITS_SAVE_ENABLE'] = "0"
if is_result:
logits = torch.load(os.path.join(logits_save_folder, 'logits_0.pth'))
logits_softmax = F.log_softmax(logits.float(), dim=-1)
logits_softmax = logits_softmax[:, choice_tokens]
for idx, ans in enumerate(batch['answer']):
choice = (logits_softmax[idx, 0] > logits_softmax[idx, 1]).cpu()
acc = choice == ans
if acc:
correct += 1
if is_result:
filename = os.path.basename(entry)
result = [filename, correct / sum, correct, sum]
self.result_logger.debug(f"result:{result}")
result_total.append(result)
correct_total += correct
sum_total += sum
if is_result:
total = ["total", correct_total / sum_total, correct_total, sum_total]
result_total.insert(0, total)
if is_result:
self.__save_result(result_total)
def __run_full_dataset_humaneval(self):
def cleanup_code(code: str) -> str:
code_splits = code.split("\n")
is_empty_line = False
ind_empty_line = None
for i, line in enumerate(code_splits):
if len(line.strip()) > 0 and line[0] != ' ' and line[0] != '\t':
is_empty_line = True
ind_empty_line = i
break
if is_empty_line:
code = "\n".join(code_splits[:ind_empty_line])
else:
end_words = ["\ndef", "\nclass", "\n#", "\nassert", '\n"""', "\nprint", "\nif", "\n\n\n"]
for w in end_words:
if w in code:
code = code[:code.rfind(w)]
return code
is_result = False
if self.__get_rank() == 0:
is_result = True
with torch.no_grad():
for entry in tqdm(glob.glob((Path(self.dataset_path) / "*.jsonl").as_posix(),
recursive=True), desc='global'):
dataset = []
with open(entry, encoding='utf-8') as f:
for line in f:
line_json = json.loads(line)
dataset.append(line_json)
correct = 0
samples = []
dataloader = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size)
for batch in tqdm(dataloader):
task_ids = [task_id.split('/')[1] for task_id in batch["task_id"]]
queries = [prompt.strip() for prompt in batch["prompt"]]
if self.model_type == "fa":
inputs = self.tokenizer(queries, padding=True, return_tensors="pt", truncation=True).to(0)
tokenizer_out_ids = inputs.input_ids.to(0)
attention_mask = inputs.attention_mask.to(0)
outputs = self.model.generate(inputs=tokenizer_out_ids, attention_mask=attention_mask,
do_sample=False, max_new_tokens=512)
if is_result:
for idx, output in enumerate(outputs.tolist()):
output = output[len(inputs["input_ids"][idx]):]
response = self.tokenizer.decode(output)
response_cleaned_up = cleanup_code(response)
self.logger.info("response_cleaned_up: %s", response_cleaned_up)
result = dict(
task_id="HumanEval/" + task_ids[idx],
completion=response_cleaned_up,
)
samples += [result]
else:
generate_text_list, _, _ = self.pa_runner.infer(queries, self.batch_size, 512, True)
generate_text_list = [cleanup_code(completion) for completion in generate_text_list]
if is_result:
self.logger.info("generate_text_list_cleaned_up: %s", generate_text_list)
for idx, sample in enumerate(generate_text_list):
result = dict(
task_id="HumanEval/" + task_ids[idx],
completion=sample,
)
samples += [result]
if is_result:
self.__save_result(samples)
if is_result:
results = evaluate_functional_correctness(self.current_result_path, [1], 4, 3.0, self.script_path + "/../dataset/full/HumanEval/human-eval.jsonl")
self.result_logger.debug(results)
def __compare_results(self):
if self.test_mode != "performance" and self.hardware_type == "NPU" and self.pa_runner.rank == 0:
if self.test_mode == "simplified":
self.__compare_simplified_dataset_results()
elif self.test_mode == "full":
dataset_list = self.get_dataset_list()
if self.dataset_name in dataset_list:
return
self.__compare_full_dataset_results()
else:
self.logger.error(self.test_mode + " not supported")
raise RuntimeError(f"{self.test_mode} not supported")
def __compare_simplified_dataset_results(self):
if not os.path.exists(f"{self.data_dir}/GPU"):
self.logger.error(f"GPU golden data not exist, upload to data dir folder")
raise RuntimeError(
"GPU golden data not exist, upload to tensor data folder")
folder_path = f"{self.result_dir}"
os.makedirs(folder_path, exist_ok=True)
if not os.path.exists(folder_path):
self.logger.error(f"folder {folder_path} create fail")
raise RuntimeError(f"result folder {folder_path} create fail")
if self.dataset_name not in question_num.keys():
self.logger.error(self.dataset_name + " not supported")
raise RuntimeError(f"{self.dataset_name} not supported")
self.eos_token = [-1 for _ in range(question_num[self.dataset_name])]
self.logger.info("---------------------" + self.dataset_name + " Batch " + str(
self.batch_size) + " Tokens Result Compare Begins------------------------")
self.__compare_results_helper("tokens")
self.logger.info("---------------------" + self.dataset_name + " Batch " + str(
self.batch_size) + " Tokens Result Compare Ends------------------------")
self.logger.info("---------------------" + self.dataset_name + " Batch " + str(
self.batch_size) + " Logits Result Compare Begins------------------------")
self.__compare_results_helper("logits")
self.logger.info("---------------------" + self.dataset_name + " Batch " + str(
self.batch_size) + " Logits Result Compare Ends------------------------")
def __compare_results_helper(self, type):
error_1e4 = 0
error_1e3 = 0
total_tokens_checked = 0
total_logits_checked = 0
greatest_kll = 0
for epoch_id in range(math.ceil(question_num[self.dataset_name] / self.batch_size)):
cnt = 0
while True:
golden_path = f"{self.data_dir}/GPU/{self.dataset_name}/batch{self.batch_size}/{epoch_id}/{type}_{cnt}.pth"
npu_path = f"{self.data_dir}/NPU/{self.dataset_name}/batch{self.batch_size}/{epoch_id}/{type}_{cnt}.pth"
golden_file_exists = os.path.exists(golden_path)
npu_file_exists = os.path.exists(npu_path)
if not golden_file_exists and not npu_file_exists:
self.result_logger.debug(self.dataset_name + " batch " + str(self.batch_size) + " epoch " + str(
epoch_id) + " " + type + " compare finish, total " + str(cnt) + " " + type)
break
elif golden_file_exists and npu_file_exists:
golden_results = torch.load(golden_path).cpu()
npu_results = torch.load(npu_path).cpu()
if type == "tokens":
for i in range(len(golden_results)):
total_tokens_checked += 1
if self.eos_token[self.batch_size * epoch_id + i] == -1 and (
npu_results[i] != golden_results[i] or npu_results[
i] == self.tokenizer.eos_token_id):
self.eos_token[self.batch_size * epoch_id + i] = cnt
self.result_logger.debug(
self.dataset_name + " batch " + str(self.batch_size) + " epoch " + str(
epoch_id) + " question " + str(self.batch_size * epoch_id + i) +
" token No." + str(
cnt) + " is the first different token or eos token, ignore checking the rest.\ngolden tokenId: " + str(
golden_results[i]) + ", npu tokenId: " + str(npu_results[i]))
elif type == "logits":
split_golden_results = torch.split(golden_results, 1, dim=0)
split_npu_results = torch.split(npu_results, 1, dim=0)
for i in range(len(split_golden_results)):
eos_token = self.eos_token[self.batch_size * epoch_id + i]
if eos_token != -1 and cnt > eos_token:
continue
total_logits_checked += 1
golden_results_logsoftmax = torch.log_softmax(split_golden_results[i].float(), dim=-1)
npu_results_logsoftmax = torch.log_softmax(split_npu_results[i].float(), dim=-1)
kl_loss = torch.nn.KLDivLoss(log_target=True, reduction='sum')
output = kl_loss(npu_results_logsoftmax, golden_results_logsoftmax)
greatest_kll = output.item() if output.item() > greatest_kll else greatest_kll
if (output > 0.0001):
if (output > 0.001):
error_1e3 += 1
error_1e4 += 1
self.result_logger.debug(
"--------------------------------" + type + " Error Begins--------------------------------")
self.result_logger.debug(
self.dataset_name + " batch" + str(self.batch_size) + " epoch " + str(
epoch_id) + " question " + str(self.batch_size * epoch_id + i) +
" logits No." + str(cnt) + " fail, KL loss is: {:.6f}".format(output.item()))
golden_logits_sorted = torch.sort(split_golden_results[i], descending=True)
npu_logits_sorted = torch.sort(split_npu_results[i], descending=True)
self.result_logger.debug(
"golden logits: \n" + str(golden_logits_sorted[0]) + "\nnpu logits: \n" + str(
npu_logits_sorted[0]))
self.result_logger.debug(
"golden index: \n" + str(golden_logits_sorted[1]) + "\nnpu index: \n" + str(
npu_logits_sorted[1]))
self.result_logger.debug(
"--------------------------------" + type + " Error Ends--------------------------------")
cnt += 1
else:
self.result_logger.debug(self.dataset_name + " batch " + str(self.batch_size) + " epoch " + str(
epoch_id) + " " + type + " size not equal")
self.result_logger.debug(self.dataset_name + " batch " + str(self.batch_size) + " epoch " + str(
epoch_id) + " " + type + " compare finish, total " + str(cnt) + " " + type)
break
if type == "tokens":
self.result_logger.debug(
self.dataset_name + " batch " + str(self.batch_size) + " finished check, total tokens num " + str(
total_tokens_checked) + ", find " +
str(len(self.eos_token) - self.eos_token.count(-1)) + " question responses have " + type + " mismatch")
elif type == "logits":
pass_rate = error_1e4 / total_logits_checked
pass_result = "Pass"
if pass_rate > 0.005 or error_1e3 > 0:
pass_result = "Fail"
self.result_logger.debug(
self.dataset_name + " batch " + str(self.batch_size) + " finished check, total logits checked " + str(
total_logits_checked) + ", " + str(error_1e4) +
" 1e-4 " + type + " errors found, " + str(
error_1e3) + " 1e-3 " + type + " errors found, 1e-4 error rate " + str(pass_rate))
csv_result = [str(self.model_name).ljust(15), str(self.dataset_name).ljust(15),
str(self.batch_size).ljust(15), str(total_logits_checked).ljust(15),
str(round(greatest_kll, 10)).ljust(15), str(round(pass_rate, 10)).ljust(15),
str(pass_result).ljust(15)]
csv_simplified_path = os.path.join(self.script_path, "../result", "simplified_test_result.csv")
if not os.path.exists(csv_simplified_path):
self.logger.warning("simplified dataset result csv file not exist, skip recording results")
raise RuntimeError(f"csv result file not exist")
with open(csv_simplified_path, 'a', newline='') as csv_simplified_file:
csv_writer = csv.writer(csv_simplified_file, delimiter='|')
csv_writer.writerow(csv_result)
self.logger.info(self.model_name + " " + self.dataset_name + " batch" + str(
self.batch_size) + " result saved in result/simplified_test_result.csv")
def __compare_full_dataset_results(self):
golden_name = '_'.join([self.model_name, self.dataset_name])
golden_path = ''
for file_name in os.listdir(f"{self.data_dir}/GPU/{self.dataset_name}/batch{self.batch_size}"):
if file_name.startswith(f"{golden_name}"):
golden_path = os.path.join(f"{self.data_dir}/GPU/{self.dataset_name}/batch{self.batch_size}", file_name)
break
if not os.path.exists(f"{self.current_result_path}"):
raise RuntimeError(
"NPU test data not exist, An error occurred in the test")
if not os.path.exists(f"{golden_path}"):
raise RuntimeError(
"GPU golden data not exist, upload to result dir folder")
result_df = pd.read_csv(self.current_result_path, sep='|', skipinitialspace=True).rename(
columns=lambda x: x.strip())
result_df = result_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
golden_df = pd.read_csv(golden_path, sep='|', skipinitialspace=True).rename(columns=lambda x: x.strip())
golden_df = golden_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
csv_result = []
if self.dataset_name == 'MMLU' or self.dataset_name == 'CEval' or self.dataset_name == 'GSM8K':
result_total = result_df.loc[result_df['file_name'] == 'total', 'value'].values[0]
golden_total = golden_df.loc[golden_df['file_name'] == 'total', 'value'].values[0]
diff_val = golden_total - result_total
pass_result = "Pass"
if diff_val <= 0.1:
self.result_logger.debug(
f"{self.current_result_path} is pass({diff_val}%), golden:{golden_total}, test:{result_total}")
else:
pass_result = "Fail"
self.result_logger.debug(
f"{self.current_result_path} is failed({diff_val}%), golden:{golden_total}, test:{result_total}")
csv_result = [str(self.model_name).ljust(15), str(self.dataset_name).ljust(15),
str(self.batch_size).ljust(15), str(round(golden_total, 10)).ljust(15),
str(round(result_total, 10)).ljust(15), str(pass_result).ljust(15)]
elif self.dataset_name == 'TruthfulQA':
if len(result_df) != len(golden_df):
raise RuntimeError(f"result_df len:{len(result_df)}, golden_df len:{len(golden_df)}")
result_MC1_sum = 0
result_MC2_sum = 0
golden_MC1_sum = 0
golden_MC2_sum = 0
pass_result = "Pass"
for index, result_row in result_df.iterrows():
golden_row = golden_df.iloc[index]
result_MC1_sum += result_row['MC1']
result_MC2_sum += result_row['MC2']
golden_MC1_sum += golden_row['MC1']
golden_MC2_sum += golden_row['MC2']
diff_MC1 = (golden_MC1_sum - result_MC1_sum) / len(result_df)
diff_MC2 = (golden_MC2_sum - result_MC2_sum) / len(result_df)
if ((diff_MC1 <= 0.1) and (diff_MC2 <= 0.1)):
self.result_logger.debug(
f"{self.current_result_path} is pass(MC1:{diff_MC1} MC2:{diff_MC2}), golden:{golden_MC2_sum / len(result_df)} , test:{result_MC2_sum / len(result_df)}")
else:
pass_result = "Fail"
self.result_logger.debug(
f"{self.current_result_path} is failed(MC1:{diff_MC1} MC2:{diff_MC2}), golden:{golden_MC2_sum / len(result_df)}, test:{result_MC2_sum / len(result_df)}")
csv_result = [str(self.model_name).ljust(15), str(self.dataset_name).ljust(15),
str(self.batch_size).ljust(15), str(round((golden_MC2_sum / len(result_df)), 10)).ljust(15),
str(round((result_MC2_sum / len(result_df)), 10)).ljust(15), str(pass_result).ljust(15)]
csv_full_path = os.path.join(self.script_path, "../result", "full_test_result.csv")
if not os.path.exists(csv_full_path):
self.logger.warning("full dataset result csv file not exist, skip recording results")
raise RuntimeError(f"csv result file not exist")
with open(csv_full_path, 'a', newline='') as csv_full_file:
csv_writer = csv.writer(csv_full_file, delimiter='|')
csv_writer.writerow(csv_result)
self.logger.info(self.model_name + " " + self.dataset_name + " batch" + str(
self.batch_size) + " result saved in result/full_test_result.csv")
def __get_rank(self):
if self.hardware_type == "GPU":
return torch.cuda.current_device()
else:
return self.pa_runner.rank
def __get_device_type(self):
if self.hardware_type == "NPU":
self.soc_version = torch_npu._C._npu_get_soc_version()
if self.soc_version in (100, 101, 102, 200, 201, 202, 203):
self.is_format_nz = True
return soc_version_map.get(self.soc_version)
elif self.hardware_type == "GPU":
return "GPU"
def __patch_hf_transformers_utils(self):
transformers_path = transformers.__path__[0]
transformers_utils_path = f"{transformers_path}/generation/utils.py"
shutil.copy(transformers_utils_path, f"{transformers_path}/generation/utils_backup.py")
with open(transformers_utils_path, "r") as utils_file:
utils_content = utils_file.readlines()
try:
utils_content.index(UTILS_CODE_INSERTED_MARKER)
except ValueError:
try:
insert_position = utils_content.index(UTILS_CODE_MARKER)
except ValueError:
self.logger.error("UTILS_CODE_MARKER not found in the transformers utils.py file.")
raise RuntimeError("UTILS_CODE_MARKER not found in the transformers utils.py file.")
utils_content.insert(insert_position + 234, UTILS_CODE_INSERTED_PART_4)
utils_content.insert(insert_position + 203, UTILS_CODE_INSERTED_PART_3)
utils_content.insert(insert_position + 154, UTILS_CODE_INSERTED_PART_2)
utils_content.insert(insert_position + 153, UTILS_CODE_INSERTED_PART_1)
with open(transformers_utils_path, "w") as utils_file:
utils_file.writelines(utils_content)
self.logger.info("transformers utils.py update success")
return
self.logger.warning("transformers utils.py not update. Please confirm it performs as you expect")
def __setup_model_parallel(self):
if self.hardware_type in communication_map:
torch.distributed.init_process_group(communication_map[self.hardware_type])
else:
self.logger.error("unsupported hardware type")
raise RuntimeError("unsupported hardware type")
self.logger.info(f"{communication_map[self.hardware_type]} distributed process init success.")
if self.hardware_type == "NPU":
self.logger.info(f"user npu:{self.rank}")
torch_npu.npu.set_device(torch.device(f"npu:{self.rank}"))
elif self.hardware_type == "GPU":
self.logger.info(f"user gpu:{self.rank}")
torch.cuda.set_device(self.rank)
self.logger.info("Device Set Success!")
def get_fa_tokenizer(self, **kwargs):
return AutoTokenizer.from_pretrained(self.weight_dir, **kwargs)
def __npu_adapt(self):
if self.is_format_nz:
for name, module in self.model.named_modules():
if isinstance(module, torch.nn.Linear):
if name == 'lm_head':
module.weight.data = torch.nn.parameter.Parameter(module.weight.data)
module.weight.data = torch_npu.npu_format_cast(module.weight.data, 29)
self.logger.info(f"current soc: {self.soc_version}({self.device_type}), cast NZ")
else:
self.logger.info(f"current soc: {self.soc_version}({self.device_type}), not cast NZ")
def __save_result(self, result):
def align_columns(df):
max_widths = df.applymap(lambda x: len(str(x))).max()
for col in df.columns:
df[col] = df[col].apply(lambda x: str(x).ljust(max_widths[col]))
return df
def align_headers(df):
max_widths = [max(len(str(col)), df[col].map(lambda x: len(str(x))).max()) for col in df.columns]
headers = [col.ljust(max_widths[i]) for i, col in enumerate(df.columns)]
df.columns = headers
for i, row in enumerate(df.values):
df.iloc[i] = [str(val).ljust(max_widths[j]) for j, val in enumerate(row)]
return df
now = datetime.now()
date_str = now.strftime("%Y_%m_%d_%H_%M_%S")
if self.quantize:
result_name = "_".join([self.model_type, self.data_type, self.quantize, "batch" + str(self.batch_size), self.test_mode, self.dataset_name]) + '_test_result'
else:
result_name = "_".join([self.model_type, self.data_type, "batch" + str(self.batch_size), self.test_mode, self.dataset_name]) + '_test_result'
if self.dataset_name == "HumanEval":
result_name += ".jsonl"
result_path = os.path.join(self.data_dir, self.hardware_type, self.dataset_name, f"batch{self.batch_size}",
result_name)
with open(result_path, 'wb') as fp:
for x in result:
fp.write((json.dumps(x) + "\n").encode('utf-8'))
else:
result_name += ".csv"
result_path = os.path.join(self.data_dir, self.hardware_type, self.dataset_name, f"batch{self.batch_size}", result_name)
if self.dataset_name == "TruthfulQA":
df = pd.DataFrame(result, columns=['idx', 'MC1', 'MC2', 'MC3'])
else:
df = pd.DataFrame(result, columns=['file_name', 'value', 'correct', 'sum'])
df = align_columns(df)
df = align_headers(df)
df.to_csv(result_path, index=False)
self.logger.info(f"{self.dataset_name} result saved to: {result_path}")
self.current_result_path = result_path
def __get_log(self, type):
if type == "log":
folder_path = self.log_dir
elif type == "result":
folder_path = self.result_dir
os.makedirs(folder_path, exist_ok=True)
if not os.path.exists(folder_path):
raise RuntimeError(f"{type} folder {folder_path} create fail")
cst_timezone = timezone(timedelta(hours=8))
current_time = datetime.now(cst_timezone)
formatted_datetime = current_time.strftime("%Y_%m_%d_%H_%M_%S")
formatter = logging.Formatter('%(asctime)s - [%(levelname)s] - %(filename)s:%(lineno)d - %(message)s')
streamer_handler = logging.StreamHandler()
streamer_handler.setFormatter(formatter)
file_handler = logging.FileHandler(os.path.join(folder_path, self.model_name + "_" + self.model_type + "_" +
self.data_type + "_" + self.dataset_name + "_batch" +
str(self.batch_size) + "_" + formatted_datetime + ".log"))
file_handler.setFormatter(formatter)
logger = logging.getLogger(type)
if type == "log":
logger.setLevel(logging.INFO)
file_handler.setLevel(logging.INFO)
streamer_handler.setLevel(logging.INFO)
elif type == "result":
logger.setLevel(logging.DEBUG)
file_handler.setLevel(logging.DEBUG)
streamer_handler.setLevel(logging.DEBUG)
logger.addHandler(streamer_handler)
logger.addHandler(file_handler)
logger.propagate = False
return logger
def parse_args():
parser = argparse.ArgumentParser(description="Model precision test arguments")
parser.add_argument(
"--model_type",
type=str,
default='pa',
choices=['fa', 'pa'],
help="Specify which model type to test"
)
parser.add_argument(
"--data_type",
type=str,
default='fp16',
choices=['fp16', 'bf16'],
help="Specify which datat type to test"
)
parser.add_argument(
"--test_mode",
type=str,
default='performance',
choices=['simplified', 'full', 'performance'],
help="Specify the mode in which to run the test"
)
parser.add_argument("--model_name", type=str, required=True, help="name of model")
parser.add_argument("--weight_dir", type=str, required=True, help="path to model weight folder")
parser.add_argument("--data_dir", type=str, help="path to save the tensor")
parser.add_argument("--dataset_name", type=str, default="GSM8K", help="which dataset to run")
parser.add_argument("--batch_size", type=int, default=1, help="batch size")
parser.add_argument("--device_id", type=int, default=7, help="device id")
parser.add_argument("--result_dir", type=str, help="path to save results")
parser.add_argument("--log_dir", type=str, help="path to save logs")
parser.add_argument("--hardware_type", type=str, default="NPU", help="current device type, GPU or NPU")
parser.add_argument("--case_pair", type=str, default="[[256, 256], [512, 512], [1024, 1024], [2048, 2048]]",
help="performance test pair")
parser.add_argument("--use_refactor", type=str, default="True", help="specify whether llama model use refactor")
parser.add_argument("--max_position_embeddings", type=int, help="specify whether llama model use refactor")
return parser.parse_args()
def get_args():
args = parse_args()
base_path = ATB_TESTDATA_PATH
test_type = "performance" if args.test_mode == "performance" else "precision"
if ATB_TESTDATA_PATH is None:
base_path = os.path.join(os.path.dirname(__file__), "../")
if args.data_dir is None:
data_dir = os.path.join(base_path, f"{test_type}_test", args.test_mode)
else:
data_dir = args.data_dir
if args.result_dir is None:
result_dir = os.path.join(base_path, f"{test_type}_test", args.test_mode)
else:
result_dir = args.result_dir
if args.log_dir is None:
log_dir = os.path.join(base_path, f"{test_type}_test", args.test_mode)
else:
log_dir = args.log_dir
case_pair = args.case_pair
if args.case_pair == "[]":
case_pair = "[[256, 256], [512, 512], [1024, 1024], [2048, 2048]]"
return [args.model_type, args.data_type, args.test_mode, args.model_name, data_dir, args.dataset_name,
args.batch_size, args.device_id, result_dir, log_dir, args.hardware_type, case_pair, args.weight_dir,
eval(args.use_refactor), args.max_position_embeddings]
================================================
FILE: llm-localization/ascend/mindie/script/run.sh
================================================
#!/bin/bash
# Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# shellcheck disable=SC2148
SCRIPT_DIR=$(cd $(dirname $0); pwd)
TESTS_DIR=$(cd $SCRIPT_DIR/core/; pwd)
test_mode="performance"
model_type="pa"
model_name=""
weight_dir=""
data_type="fp16"
hardware_type="NPU"
chip_num=0
dataset="CEval"
batch_size=0
case_pair="[]"
use_refactor="True"
max_position_embedding=-1
function fn_prepare()
{
if [ "$hardware_type" == "NPU" ]; then
if [ -z "$ASCEND_HOME_PATH" ];then
echo "env ASCEND_HOME_PATH not exists, fail"
exit 0
fi
if [ -z "$ATB_HOME_PATH" ];then
echo "env ATB_HOME_PATH not exists, fail"
exit 0
fi
fi
export INT8_FORMAT_NZ_ENABLE=1
export PYTHONPATH="${PYTHONPATH}:$(dirname "$(readlink -f "$0")")"
export PYTHONPATH="${PYTHONPATH}:$(dirname "$(dirname "$(dirname "$(readlink -f "$0")")")")"
IFS="_"
read -ra parts <<< "$1"
model_type="${parts[0]}"
if [ "$model_type" == "pa" ]; then
data_type="${parts[1]}"
fi
test_mode="$2"
if ! [ "$test_mode" == "performance" ]; then
read -ra parts <<< "$2"
test_mode="${parts[0]}"
dataset="${parts[1]}"
fi
if [ "$test_mode" == "performance" ]; then
export ATB_LLM_BENCHMARK_ENABLE=1
export ATB_LLM_BENCHMARK_FILEPATH="${SCRIPT_DIR}/benchmark.csv"
fi
}
function fn_run_single()
{
test_file="${model_name}_test.py"
test_path="${TESTS_DIR}/${test_file}"
if [[ ! -e "$test_path" ]];then
echo "model test file $test_path is not found."
exit 0
fi
if [ "$chip_num" == 0 ]; then
code_line=$(grep -A 1 "def get_chip_num(self):" "${test_path}" | tail -n 1)
if [ -z "$code_line" ]; then
echo "Warning: get_chip_num() not overwrite in '$test_file', use chip_num 1"
chip_num=1
else
chip_num=$(echo "$code_line" | awk -F 'return ' '{print $2}')
if ! [[ "$chip_num" =~ ^[1-9]+$ ]]; then
echo "Error: return value of get_chip_num() in '$test_file' is not a digit."
exit 1
fi
fi
fi
if [ "$hardware_type" == "NPU" ]; then
if ! [ -n "$ASCEND_RT_VISIBLE_DEVICES" ]; then
devices=""
for ((i=0; i /dev/null; then
hardware_type="GPU"
echo "INFO: Detected NVIDIA GPU"
else
if command -v npu-smi info &> /dev/null; then
echo "INFO: Detected Ascend NPU"
else
echo "Error: No GPU or NPU detected"
exit 1
fi
fi
if [ $# -eq 0 ]; then
echo "Error: require parameter. Please refer to README."
exit 1
fi
model_type=$1
case "$model_type" in
fa|pa_fp16|pa_bf16)
echo "INFO: current model_type: $model_type"
;;
*)
echo "ERROR: invalid model_type, only support fa, pa_fp16, pa_bf16"
;;
esac
test_modes=$2
case "$test_modes" in
performance|simplified_GSM8K|simplified_TruthfulQA|full_CEval|full_GSM8K|full_MMLU|full_TruthfulQA|full_BoolQ|full_HumanEval)
echo "INFO: current test_mode: $test_modes"
;;
*)
echo "ERROR: invalid test_mode, only support performance, simplified_GSM8K, simplified_TruthfulQA, \
full_CEval, full_GSM8K, full_MMLU, full_TruthfulQA, full_BoolQ, full_HumanEval"
exit 1
;;
esac
if [ "$test_modes" == "performance" ]; then
case_pair=$3
shift
fi
batch_size=$3
model_name=$4
if [ "$model_name" == "llama" ]; then
use_refactor=$5
shift
fi
weight_dir=$5
echo "INFO: current batch_size: $batch_size"
echo "INFO: current model_name: $model_name"
echo "INFO: current weight_dir: $weight_dir"
fn_prepare "$model_type" "$test_modes"
if ! [[ "$6" =~ ^[1-9]+$ ]]; then
echo "Error: input chip_num is not a digit."
exit 1
fi
chip_num=$6
echo "INFO: use input chip_num $chip_num"
if [ $# -ge 7 ]; then
if ! [[ "$7" =~ ^[0-9]+$ ]]; then
echo "Error: input max_position_embedding or max_seq_len is not a digit."
exit 1
fi
max_position_embedding=$7
echo "INFO: use input max_position_embedding or max_seq_len $max_position_embedding"
fi
fn_run_single
}
fn_main "$@"
================================================
FILE: llm-localization/ascend/mindie/性能调优.md
================================================
910b4 llama-7b 10g KV CACHE
Total Block Num = 160
Block Num = Ceil(输入Token数/Block Size)+Ceil(最大输出Token数/Block Size)
560/4 + 512/4 = 9
batch_size: 20
910B3 llama-7b 30g KV CACHE
Total Block Num = 480
560/4 + 512/4 = 9
batch_size: 50
================================================
FILE: llm-localization/ascend/mindie/日志分析.txt
================================================
tail -100f mindservice.log | grep "COMPLETED REQ ID"
2024-07-24 16:25:04.777655 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1236, 3 , 272 , 16 , 256 , 20 , 30 , 1
2024-07-24 16:25:05.234118 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1239, 3 , 271 , 15 , 256 , 20 , 29 , 2
2024-07-24 16:25:05.360007 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1262, 1 , 99 , 22 , 77 , 20 , 26 , 1
2024-07-24 16:25:05.571847 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1244, 2 , 234 , 22 , 212 , 20 , 26 , 1
2024-07-24 16:25:05.705152 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1241, 3 , 281 , 25 , 256 , 20 , 25 , 1
2024-07-24 16:25:06.538975 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1258, 2 , 145 , 16 , 129 , 20 , 27 , 2
2024-07-24 16:25:06.901611 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1274, 1 , 41 , 15 , 26 , 20 , 27 , 1
2024-07-24 16:25:07.724699 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1253, 2 , 195 , 13 , 182 , 20 , 29 , 1
2024-07-24 16:25:07.940994 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1277, 1 , 45 , 17 , 28 , 20 , 29 , 1
2024-07-24 16:25:08.764214 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1257, 2 , 201 , 17 , 184 , 20 , 31 , 1
2024-07-24 16:25:08.973185 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1280, 1 , 40 , 19 , 21 , 20 , 31 , 1
2024-07-24 16:25:10.494941 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1282, 1 , 56 , 25 , 31 , 20 , 33 , 1
2024-07-24 16:25:10.541398 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1250, 3 , 269 , 13 , 256 , 19 , 32 , 1
2024-07-24 16:25:13.150968 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1284, 1 , 82 , 28 , 54 , 20 , 40 , 1
2024-07-24 16:25:13.282448 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1259, 3 , 273 , 17 , 256 , 20 , 41 , 2
2024-07-24 16:25:13.913430 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1261, 3 , 273 , 17 , 256 , 20 , 38 , 1
2024-07-24 16:25:14.495745 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1263, 3 , 279 , 23 , 256 , 20 , 40 , 1
2024-07-24 16:25:14.717521 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1264, 3 , 268 , 12 , 256 , 20 , 38 , 2
2024-07-24 16:25:15.027415 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1266, 3 , 268 , 12 , 256 , 20 , 35 , 1
2024-07-24 16:25:15.521481 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1287, 1 , 61 , 16 , 45 , 20 , 34 , 1
2024-07-24 16:25:15.567090 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1267, 3 , 273 , 17 , 256 , 19 , 33 , 1
2024-07-24 16:25:16.039858 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1268, 3 , 272 , 16 , 256 , 20 , 33 , 1
2024-07-24 16:25:16.432710 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1269, 3 , 329 , 73 , 256 , 20 , 31 , 1
2024-07-24 16:25:17.082790 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1270, 3 , 263 , 16 , 247 , 20 , 30 , 1
2024-07-24 16:25:17.339481 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1289, 1 , 72 , 15 , 57 , 20 , 30 , 1
2024-07-24 16:25:17.993777 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1271, 3 , 270 , 14 , 256 , 20 , 31 , 1
2024-07-24 16:25:18.121696 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1273, 3 , 271 , 15 , 256 , 20 , 29 , 1
2024-07-24 16:25:18.248203 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1286, 1 , 116 , 17 , 99 , 20 , 27 , 1
2024-07-24 16:25:18.458886 1360 info ibis_request.cc:240] COMPLETED REQ ID: 1275, 3 , 280 , 24 , 256 , 20 , 27 , 1
================================================
FILE: llm-localization/ascend/mindspore/MindSpore-note.md
================================================
Graph模式:静态图模式或者图模式,将神经网络模型编译成一整张图,然后下发执行。该模式利用图优化等技术提高运行性能,同时有助于规模部署和跨平台运行。
PyNative模式:动态图模式,将神经网络中的各个算子逐一下发执行,方便用户编写和调试神经网络模型。
二者的主要区别也十分的明显。
使用场景:Graph模式需要一开始就构建好网络结构,然后框架做整图优化和执行,比较适合网络固定没有变化,且需要高性能的场景。而PyNative模式逐行执行算子,支持单独求梯度。
网络执行:Graph模式和PyNative模式在执行相同的网络和算子时,精度效果是一致的。由于Graph模式运用了图优化、计算图整图下沉等技术,Graph模式执行网络的性能和效率更高。
代码调试:在脚本开发和网络流程调试中,推荐使用PyNative模式进行调试。在PyNative模式下,可以方便地设置断点,获取网络执行的中间结果,也可以通过pdb的方式对网络进行调试。而Graph模式无法设置断点,只能先指定算子进行打印,然后在网络执行完成后查看输出结果。
================================================
FILE: llm-localization/ascend/mindspore/README.md
================================================
- https://gitee.com/mindspore/mindspore
```
import numpy as np
import mindspore.context as context
from mindspore import Tensor
from mindspore.ops import functional as F
context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU")
x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
y = Tensor(np.ones([1,3,3,4]).astype(np.float32))
print(F.tensor_add(x, y))
```
https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.3.0rc1/MindSpore/unified/aarch64/mindspore-2.3.0rc1-cp39-cp39-linux_aarch64.whl
================================================
FILE: llm-localization/ascend/mindspore/bert.md
================================================
## bert
```
pip install wikiextractor
python -m wikiextractor.WikiExtractor -o