Repository: OpenGVLab/VideoChat-Flash Branch: main Commit: 2f8e2f578897 Files: 1928 Total size: 31.2 MB Directory structure: gitextract_mlwsex56/ ├── .gitattributes ├── LICENSE ├── README.md ├── llava-train_videochat/ │ ├── .dockerignore │ ├── .editorconfig │ ├── .gitattributes │ ├── .gitignore │ ├── LICENSE │ ├── README.md │ ├── cog.yaml │ ├── data/ │ │ ├── ablation_short-long_mix_sft.yaml │ │ ├── stage1_init_connector_iv1m.yaml │ │ ├── stage2_short_pretrain_iv6m.yaml │ │ ├── stage3_short-long_mix_sft.yaml │ │ └── stage4_highres_postsft.yaml │ ├── llava/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── conversation.py │ │ ├── dist_utils.py │ │ ├── mm_utils.py │ │ ├── model/ │ │ │ ├── __init__.py │ │ │ ├── apply_delta.py │ │ │ ├── builder.py │ │ │ ├── consolidate.py │ │ │ ├── language_model/ │ │ │ │ ├── llava_qwen.py │ │ │ │ ├── llava_qwen_flash.py │ │ │ │ └── modeling_qwen2_flash.py │ │ │ ├── llava_arch.py │ │ │ ├── make_delta.py │ │ │ ├── multimodal_encoder/ │ │ │ │ ├── builder.py │ │ │ │ ├── clip_encoder.py │ │ │ │ ├── internvideo2/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── flash_attention_class.py │ │ │ │ │ ├── pos_embed.py │ │ │ │ │ └── vit_scale_clean.py │ │ │ │ ├── internvideo2_encoder.py │ │ │ │ ├── siglip_encoder.py │ │ │ │ ├── umt/ │ │ │ │ │ └── vit.py │ │ │ │ └── umt_encoder.py │ │ │ ├── multimodal_projector/ │ │ │ │ ├── builder.py │ │ │ │ └── tome16_mlp_hd64.py │ │ │ └── utils.py │ │ ├── serialize_utils.py │ │ ├── train/ │ │ │ ├── llava_trainer.py │ │ │ ├── llava_trainer_eval.py │ │ │ ├── train.py │ │ │ └── train_mem.py │ │ ├── utils.py │ │ └── video_utils.py │ ├── pyproject.toml │ ├── requirements.txt │ └── scripts/ │ ├── train/ │ │ ├── stage1-init_connector/ │ │ │ ├── stage1_internvideo2_tome16_res224_qwen7b.sh │ │ │ ├── stage1_umt_tome16_res224_qwen7b.sh │ │ │ └── stage1_umt_tome16_res448_qwen1_5b.sh │ │ ├── stage2-visual_pretraining/ │ │ │ ├── stage2_internvideo2_tome16_res224_qwen_7b.sh │ │ │ ├── stage2_umt_tome16_res224_qwen_7b.sh │ │ │ └── stage2_umt_tome16_res448_qwen_1_5b.sh │ │ ├── stage3-video_sft/ │ │ │ ├── stage3_internvideo2_tome16_res224_qwen_7b.sh │ │ │ ├── stage3_umt_tome16_res224_qwen_7b.sh │ │ │ └── stage3_umt_tome16_res448_qwen_1_5b.sh │ │ └── stage4_highres_postft/ │ │ └── stage4_umt_tome16_res448_qwen_7b.sh │ ├── zero1.json │ ├── zero2.json │ ├── zero2_fused_adamw.json │ ├── zero2_offload.json │ ├── zero3.json │ ├── zero3_offload.json │ └── zero3pp.json ├── lmms-eval_videochat/ │ ├── .gitignore │ ├── .pre-commit-config.yaml │ ├── LICENSE │ ├── README.md │ ├── docs/ │ │ ├── README.md │ │ ├── commands.md │ │ ├── current_tasks.md │ │ ├── model_guide.md │ │ ├── run_examples.md │ │ └── task_guide.md │ ├── eval_annotations/ │ │ ├── LVBench/ │ │ │ ├── README.md │ │ │ └── json/ │ │ │ ├── lvbench_clean.json │ │ │ ├── lvbench_clean_cartoon.json │ │ │ ├── lvbench_clean_documentary.json │ │ │ ├── lvbench_clean_live.json │ │ │ ├── lvbench_clean_selfmedia.json │ │ │ ├── lvbench_clean_sport.json │ │ │ └── lvbench_clean_tv.json │ │ ├── LongVideoBench/ │ │ │ ├── README.md │ │ │ ├── lvb_test_wo_gt.json │ │ │ ├── lvb_val.json │ │ │ ├── test-00000-of-00001.parquet │ │ │ └── validation-00000-of-00001.parquet │ │ ├── MLVU_MC/ │ │ │ ├── README.md │ │ │ └── json/ │ │ │ ├── 1_plotQA.json │ │ │ ├── 2_needle.json │ │ │ ├── 3_ego.json │ │ │ ├── 4_count.json │ │ │ ├── 5_order.json │ │ │ ├── 6_anomaly_reco.json │ │ │ └── 7_topic_reasoning.json │ │ ├── MVBench/ │ │ │ ├── README.md │ │ │ └── json/ │ │ │ ├── action_antonym.json │ │ │ ├── action_count.json │ │ │ ├── action_localization.json │ │ │ ├── action_prediction.json │ │ │ ├── action_sequence.json │ │ │ ├── character_order.json │ │ │ ├── counterfactual_inference.json │ │ │ ├── egocentric_navigation.json │ │ │ ├── episodic_reasoning.json │ │ │ ├── fine_grained_action.json │ │ │ ├── fine_grained_pose.json │ │ │ ├── moving_attribute.json │ │ │ ├── moving_count.json │ │ │ ├── moving_direction.json │ │ │ ├── object_existence.json │ │ │ ├── object_interaction.json │ │ │ ├── object_shuffle.json │ │ │ ├── scene_transition.json │ │ │ ├── state_change.json │ │ │ └── unexpected_action.json │ │ ├── PerceptionTest/ │ │ │ ├── .gitattributes │ │ │ └── README.md │ │ ├── Temporal_Grounding/ │ │ │ ├── README.md │ │ │ └── json/ │ │ │ └── temporal_grounding_charades.json │ │ └── Video-MME/ │ │ ├── README.md │ │ └── videomme/ │ │ └── test-00000-of-00001.parquet │ ├── lmms_eval/ │ │ ├── __init__.py │ │ ├── __main__.py │ │ ├── api/ │ │ │ ├── __init__.py │ │ │ ├── filter.py │ │ │ ├── instance.py │ │ │ ├── metrics.py │ │ │ ├── model.py │ │ │ ├── registry.py │ │ │ ├── samplers.py │ │ │ └── task.py │ │ ├── evaluator.py │ │ ├── filters/ │ │ │ ├── __init__.py │ │ │ ├── decontamination.py │ │ │ ├── extraction.py │ │ │ ├── selection.py │ │ │ └── transformation.py │ │ ├── logging_utils.py │ │ ├── models/ │ │ │ ├── __init__.py │ │ │ └── videochat_flash.py │ │ ├── tasks/ │ │ │ ├── __init__.py │ │ │ ├── _task_utils/ │ │ │ │ ├── file_utils.py │ │ │ │ ├── gpt_eval_utils.py │ │ │ │ ├── video_loader.py │ │ │ │ └── vqa_eval_metric.py │ │ │ ├── longvideobench/ │ │ │ │ ├── longvideobench_test_v.yaml │ │ │ │ ├── longvideobench_val_i.yaml │ │ │ │ ├── longvideobench_val_v.yaml │ │ │ │ └── utils.py │ │ │ ├── lvbench/ │ │ │ │ ├── _default_template.yaml │ │ │ │ ├── lvbench.yaml │ │ │ │ ├── lvbench_cartoon.yaml │ │ │ │ ├── lvbench_documentary.yaml │ │ │ │ ├── lvbench_live.yaml │ │ │ │ ├── lvbench_selfmedia.yaml │ │ │ │ ├── lvbench_sport.yaml │ │ │ │ ├── lvbench_tv.yaml │ │ │ │ └── utils.py │ │ │ ├── mlvu_mc/ │ │ │ │ ├── _default_template.yaml │ │ │ │ ├── mlvu_mc.yaml │ │ │ │ ├── mlvu_mc_anomaly_reco.yaml │ │ │ │ ├── mlvu_mc_count.yaml │ │ │ │ ├── mlvu_mc_ego.yaml │ │ │ │ ├── mlvu_mc_needle.yaml │ │ │ │ ├── mlvu_mc_order.yaml │ │ │ │ ├── mlvu_mc_plotqa.yaml │ │ │ │ ├── mlvu_mc_topic_reasoning.yaml │ │ │ │ └── utils.py │ │ │ ├── mvbench/ │ │ │ │ ├── _default_template.yaml │ │ │ │ ├── mvbench.yaml │ │ │ │ ├── mvbench_action_antonym.yaml │ │ │ │ ├── mvbench_action_count.yaml │ │ │ │ ├── mvbench_action_localization.yaml │ │ │ │ ├── mvbench_action_prediction.yaml │ │ │ │ ├── mvbench_action_sequence.yaml │ │ │ │ ├── mvbench_character_order.yaml │ │ │ │ ├── mvbench_counterfactual_inference.yaml │ │ │ │ ├── mvbench_egocentric_navigation.yaml │ │ │ │ ├── mvbench_episodic_reasoning.yaml │ │ │ │ ├── mvbench_fine_grained_action.yaml │ │ │ │ ├── mvbench_fine_grained_pose.yaml │ │ │ │ ├── mvbench_moving_attribute.yaml │ │ │ │ ├── mvbench_moving_count.yaml │ │ │ │ ├── mvbench_moving_direction.yaml │ │ │ │ ├── mvbench_object_existence.yaml │ │ │ │ ├── mvbench_object_interaction.yaml │ │ │ │ ├── mvbench_object_shuffle.yaml │ │ │ │ ├── mvbench_scene_transition.yaml │ │ │ │ ├── mvbench_state_change.yaml │ │ │ │ ├── mvbench_unexpected_action.yaml │ │ │ │ └── utils.py │ │ │ ├── perceptiontest/ │ │ │ │ └── val/ │ │ │ │ ├── _default_template_yaml │ │ │ │ ├── perceptiontest_mc.yaml │ │ │ │ └── utils.py │ │ │ ├── temporal_grounding/ │ │ │ │ ├── _default_template.yaml │ │ │ │ ├── charades.yaml │ │ │ │ ├── eval_tvg.py │ │ │ │ └── utils.py │ │ │ └── videomme/ │ │ │ ├── utils.py │ │ │ ├── videomme.yaml │ │ │ └── videomme_w_subtitle.yaml │ │ └── utils.py │ ├── pyproject.toml │ ├── scripts/ │ │ ├── eval_longvideobench.sh │ │ ├── eval_lvbench.sh │ │ ├── eval_mlvu.sh │ │ ├── eval_mvbench.sh │ │ ├── eval_perceptiontest_val_mc.sh │ │ ├── eval_temporal_grounding_chardes.sh │ │ └── eval_videomme.sh │ ├── setup.py │ └── videochat-flash-7B@448_eval_log_videomme.json ├── xtuner-eval_niah/ │ ├── README.md │ ├── llava/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── conversation.py │ │ ├── dist_utils.py │ │ ├── mm_utils.py │ │ ├── model/ │ │ │ ├── __init__.py │ │ │ ├── apply_delta.py │ │ │ ├── builder.py │ │ │ ├── consolidate.py │ │ │ ├── language_model/ │ │ │ │ ├── llava_qwen.py │ │ │ │ ├── llava_qwen_flash.py │ │ │ │ └── modeling_qwen2_flash.py │ │ │ ├── llava_arch.py │ │ │ ├── make_delta.py │ │ │ ├── multimodal_encoder/ │ │ │ │ ├── builder.py │ │ │ │ ├── clip_encoder.py │ │ │ │ ├── internvideo2/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── flash_attention_class.py │ │ │ │ │ ├── pos_embed.py │ │ │ │ │ └── vit_scale_clean.py │ │ │ │ ├── internvideo2_encoder.py │ │ │ │ ├── siglip_encoder.py │ │ │ │ ├── umt/ │ │ │ │ │ └── vit.py │ │ │ │ └── umt_encoder.py │ │ │ ├── multimodal_projector/ │ │ │ │ ├── builder.py │ │ │ │ └── tome16_mlp_hd64.py │ │ │ └── utils.py │ │ ├── serialize_utils.py │ │ ├── train/ │ │ │ ├── llava_trainer.py │ │ │ ├── llava_trainer_eval.py │ │ │ ├── train.py │ │ │ └── train_mem.py │ │ ├── utils.py │ │ └── video_utils.py │ ├── longva/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── conversation.py │ │ ├── mm_utils.py │ │ ├── model/ │ │ │ ├── __init__.py │ │ │ ├── apply_delta.py │ │ │ ├── builder.py │ │ │ ├── consolidate.py │ │ │ ├── language_model/ │ │ │ │ ├── llava_llama.py │ │ │ │ ├── llava_mistral.py │ │ │ │ ├── llava_mpt.py │ │ │ │ ├── llava_qwen.py │ │ │ │ └── modeling_llama.py │ │ │ ├── llava_arch.py │ │ │ ├── make_delta.py │ │ │ ├── multimodal_encoder/ │ │ │ │ ├── builder.py │ │ │ │ └── clip_encoder.py │ │ │ ├── multimodal_projector/ │ │ │ │ ├── builder.py │ │ │ │ └── pooler_projector.py │ │ │ ├── multimodal_resampler/ │ │ │ │ ├── builder.py │ │ │ │ ├── masked_drop.py │ │ │ │ ├── perceiver.py │ │ │ │ ├── qformer.py │ │ │ │ └── spatial_pool.py │ │ │ └── utils.py │ │ ├── train/ │ │ │ ├── llama_flash_attn_monkey_patch.py │ │ │ ├── llava_trainer.py │ │ │ ├── train.py │ │ │ ├── train_dpo.py │ │ │ └── train_mem.py │ │ └── utils.py │ ├── niah_requirements.txt │ ├── tmp/ │ │ └── git_placeholder │ ├── vision_niah/ │ │ ├── data/ │ │ │ ├── haystack_embeddings/ │ │ │ │ └── git_placeholder │ │ │ ├── haystack_videos/ │ │ │ │ └── git_placeholder │ │ │ ├── needle_embeddings/ │ │ │ │ └── git_placeholder │ │ │ └── source_data/ │ │ │ ├── git_placeholder │ │ │ └── niah-coco-singlehop_20.json │ │ ├── data_multi/ │ │ │ ├── needle_embeddings/ │ │ │ │ └── git_placeholder │ │ │ └── source_data/ │ │ │ ├── git_placeholder │ │ │ └── niah-coco-multihop-100.json │ │ ├── flash_eval_xtuner_multi.sh │ │ ├── flash_eval_xtuner_single.sh │ │ ├── log/ │ │ │ ├── s1/ │ │ │ │ └── git_placeholder │ │ │ ├── s2/ │ │ │ │ └── git_placeholder │ │ │ └── s3/ │ │ │ └── git_placeholder │ │ ├── longva_eval_xtuner_multi.sh │ │ ├── longva_eval_xtuner_single.sh │ │ ├── model_weights/ │ │ │ └── git_placeholder │ │ ├── multi_eval_vision_niah.py │ │ ├── multi_produce_needle_embedding.py │ │ ├── niah_output_multi/ │ │ │ └── git_placeholder │ │ ├── niah_output_single/ │ │ │ └── git_placeholder │ │ ├── produce_haystack_embedding.py │ │ ├── single_eval_vision_niah.py │ │ └── single_produce_needle_embedding.py │ └── xtuner/ │ ├── __init__.py │ ├── _lite/ │ │ ├── __init__.py │ │ ├── accelerate/ │ │ │ ├── __init__.py │ │ │ ├── dispatches/ │ │ │ │ ├── __init__.py │ │ │ │ ├── _attention.py │ │ │ │ ├── _fused/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── layer_norm.py │ │ │ │ │ ├── rms_norm.py │ │ │ │ │ └── rotary.py │ │ │ │ ├── clip.py │ │ │ │ ├── internlm2.py │ │ │ │ ├── llama.py │ │ │ │ └── qwen2.py │ │ │ ├── generate.py │ │ │ ├── lora.py │ │ │ └── packed.py │ │ ├── auto.py │ │ ├── chat/ │ │ │ ├── __init__.py │ │ │ ├── backends/ │ │ │ │ └── __init__.py │ │ │ ├── messages/ │ │ │ │ ├── __init__.py │ │ │ │ ├── base.py │ │ │ │ └── chat.py │ │ │ └── templates/ │ │ │ ├── __init__.py │ │ │ ├── chat.py │ │ │ └── hybrid.py │ │ ├── datasets/ │ │ │ ├── __init__.py │ │ │ ├── cache.py │ │ │ ├── format.py │ │ │ ├── llava.py │ │ │ ├── load.py │ │ │ ├── pretrain.py │ │ │ ├── text.py │ │ │ └── tokenize.py │ │ ├── modelings/ │ │ │ ├── __init__.py │ │ │ ├── internlm2/ │ │ │ │ ├── __init__.py │ │ │ │ ├── configuration_internlm2.py │ │ │ │ └── modeling_internlm2.py │ │ │ └── llava/ │ │ │ ├── __init__.py │ │ │ ├── configuration_internlm2.py │ │ │ ├── configuration_llava.py │ │ │ ├── modeling_internlm2.py │ │ │ ├── modeling_llava.py │ │ │ └── processing_llava.py │ │ ├── parallel/ │ │ │ ├── __init__.py │ │ │ ├── comm.py │ │ │ ├── fsdp/ │ │ │ │ ├── __init__.py │ │ │ │ ├── checkpointing.py │ │ │ │ ├── lazy.py │ │ │ │ ├── precision.py │ │ │ │ └── wrap.py │ │ │ ├── logger.py │ │ │ ├── plans/ │ │ │ │ └── internlm2.py │ │ │ ├── sampler.py │ │ │ ├── sequence/ │ │ │ │ ├── __init__.py │ │ │ │ ├── attention.py │ │ │ │ ├── data_collate.py │ │ │ │ ├── ops.py │ │ │ │ └── reduce_loss.py │ │ │ └── setup.py │ │ └── yunchang/ │ │ ├── __init__.py │ │ ├── comm/ │ │ │ ├── __init__.py │ │ │ ├── all_to_all.py │ │ │ └── extract_local.py │ │ ├── globals.py │ │ ├── hybrid/ │ │ │ ├── __init__.py │ │ │ ├── async_attn_layer.py │ │ │ ├── attn_layer.py │ │ │ └── utils.py │ │ ├── ring/ │ │ │ ├── __init__.py │ │ │ ├── llama3_flash_attn_varlen.py │ │ │ ├── ring_flash_attn.py │ │ │ ├── ring_flash_attn_varlen.py │ │ │ ├── stripe_flash_attn.py │ │ │ ├── triton_utils.py │ │ │ ├── utils.py │ │ │ ├── zigzag_ring_flash_attn.py │ │ │ └── zigzag_ring_flash_attn_varlen.py │ │ └── ulysses/ │ │ ├── __init__.py │ │ └── attn_layer.py │ ├── apis/ │ │ ├── __init__.py │ │ ├── datasets/ │ │ │ ├── __init__.py │ │ │ ├── alpaca.py │ │ │ ├── arxiv.py │ │ │ ├── code_alpaca.py │ │ │ ├── colorist.py │ │ │ ├── lawyer.py │ │ │ ├── medical.py │ │ │ ├── moss_003_sft.py │ │ │ ├── oasst1.py │ │ │ ├── open_orca.py │ │ │ ├── sql.py │ │ │ ├── tiny_codes.py │ │ │ └── wizardlm.py │ │ ├── model.py │ │ └── training_args.py │ ├── configs/ │ │ ├── __init__.py │ │ ├── baichuan/ │ │ │ ├── baichuan2_13b_base/ │ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_colorist_e5.py │ │ │ │ ├── baichuan2_13b_base_qlora_lawyer_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_oasst1_e3.py │ │ │ │ ├── baichuan2_13b_base_qlora_open_platypus_e3.py │ │ │ │ └── baichuan2_13b_base_qlora_sql_e3.py │ │ │ ├── baichuan2_13b_chat/ │ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_lawyer_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_e3.py │ │ │ │ └── baichuan2_13b_chat_qlora_open_platypus_e3.py │ │ │ ├── baichuan2_7b_base/ │ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_colorist_e5.py │ │ │ │ ├── baichuan2_7b_base_qlora_lawyer_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_oasst1_e3.py │ │ │ │ ├── baichuan2_7b_base_qlora_open_platypus_e3.py │ │ │ │ └── baichuan2_7b_base_qlora_sql_e3.py │ │ │ ├── baichuan2_7b_chat/ │ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_lawyer_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_e3.py │ │ │ │ └── baichuan2_7b_chat_qlora_open_platypus_e3.py │ │ │ ├── baichuan_13b_base/ │ │ │ │ ├── baichuan_13b_base_qlora_alpaca_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_colorist_e5.py │ │ │ │ ├── baichuan_13b_base_qlora_lawyer_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_medical_e1.py │ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e1.py │ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e2_gpu8.py │ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_plugins_e1.py │ │ │ │ ├── baichuan_13b_base_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_oasst1_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_open_platypus_e3.py │ │ │ │ ├── baichuan_13b_base_qlora_openorca_e1.py │ │ │ │ ├── baichuan_13b_base_qlora_sql_e3.py │ │ │ │ └── baichuan_13b_base_qlora_tiny_codes_e1.py │ │ │ ├── baichuan_13b_chat/ │ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_zh_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_colorist_e5.py │ │ │ │ ├── baichuan_13b_chat_qlora_lawyer_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_medical_e1.py │ │ │ │ ├── baichuan_13b_chat_qlora_oasst1_512_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_oasst1_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_open_platypus_e3.py │ │ │ │ ├── baichuan_13b_chat_qlora_openorca_e1.py │ │ │ │ ├── baichuan_13b_chat_qlora_sql_e3.py │ │ │ │ └── baichuan_13b_chat_qlora_tiny_codes_e1.py │ │ │ └── baichuan_7b/ │ │ │ ├── baichuan_7b_qlora_alpaca_e3.py │ │ │ ├── baichuan_7b_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan_7b_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── baichuan_7b_qlora_code_alpaca_e3.py │ │ │ ├── baichuan_7b_qlora_colorist_e5.py │ │ │ ├── baichuan_7b_qlora_lawyer_e3.py │ │ │ ├── baichuan_7b_qlora_medical_e1.py │ │ │ ├── baichuan_7b_qlora_moss_sft_all_e1.py │ │ │ ├── baichuan_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ ├── baichuan_7b_qlora_moss_sft_plugins_e1.py │ │ │ ├── baichuan_7b_qlora_oasst1_512_e3.py │ │ │ ├── baichuan_7b_qlora_oasst1_e3.py │ │ │ ├── baichuan_7b_qlora_open_platypus_e3.py │ │ │ ├── baichuan_7b_qlora_openorca_e1.py │ │ │ ├── baichuan_7b_qlora_sql_e3.py │ │ │ └── baichuan_7b_qlora_tiny_codes_e1.py │ │ ├── chatglm/ │ │ │ ├── chatglm2_6b/ │ │ │ │ ├── chatglm2_6b_qlora_alpaca_e3.py │ │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── chatglm2_6b_qlora_alpaca_zh_e3.py │ │ │ │ ├── chatglm2_6b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── chatglm2_6b_qlora_code_alpaca_e3.py │ │ │ │ ├── chatglm2_6b_qlora_colorist_e5.py │ │ │ │ ├── chatglm2_6b_qlora_lawyer_e3.py │ │ │ │ ├── chatglm2_6b_qlora_medical_e1.py │ │ │ │ ├── chatglm2_6b_qlora_oasst1_512_e3.py │ │ │ │ ├── chatglm2_6b_qlora_oasst1_e3.py │ │ │ │ ├── chatglm2_6b_qlora_open_platypus_e3.py │ │ │ │ ├── chatglm2_6b_qlora_openorca_e1.py │ │ │ │ ├── chatglm2_6b_qlora_sql_e3.py │ │ │ │ └── chatglm2_6b_qlora_tiny_codes_e1.py │ │ │ ├── chatglm3_6b/ │ │ │ │ ├── chatglm3_6b_qlora_alpaca_e3.py │ │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── chatglm3_6b_qlora_alpaca_zh_e3.py │ │ │ │ ├── chatglm3_6b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── chatglm3_6b_qlora_code_alpaca_e3.py │ │ │ │ ├── chatglm3_6b_qlora_colorist_e5.py │ │ │ │ ├── chatglm3_6b_qlora_lawyer_e3.py │ │ │ │ ├── chatglm3_6b_qlora_medical_e1.py │ │ │ │ ├── chatglm3_6b_qlora_oasst1_512_e3.py │ │ │ │ ├── chatglm3_6b_qlora_oasst1_e3.py │ │ │ │ ├── chatglm3_6b_qlora_open_platypus_e3.py │ │ │ │ ├── chatglm3_6b_qlora_openorca_e1.py │ │ │ │ ├── chatglm3_6b_qlora_sql_e3.py │ │ │ │ └── chatglm3_6b_qlora_tiny_codes_e1.py │ │ │ └── chatglm3_6b_base/ │ │ │ ├── chatglm3_6b_base_qlora_alpaca_e3.py │ │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_e3.py │ │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── chatglm3_6b_base_qlora_alpaca_zh_e3.py │ │ │ ├── chatglm3_6b_base_qlora_arxiv_gentitle_e3.py │ │ │ ├── chatglm3_6b_base_qlora_code_alpaca_e3.py │ │ │ ├── chatglm3_6b_base_qlora_colorist_e5.py │ │ │ ├── chatglm3_6b_base_qlora_lawyer_e3.py │ │ │ ├── chatglm3_6b_base_qlora_medical_e1.py │ │ │ ├── chatglm3_6b_base_qlora_oasst1_512_e3.py │ │ │ ├── chatglm3_6b_base_qlora_oasst1_e3.py │ │ │ ├── chatglm3_6b_base_qlora_open_platypus_e3.py │ │ │ ├── chatglm3_6b_base_qlora_openorca_e1.py │ │ │ ├── chatglm3_6b_base_qlora_sql_e3.py │ │ │ └── chatglm3_6b_base_qlora_tiny_codes_e1.py │ │ ├── cohere/ │ │ │ ├── README.md │ │ │ └── cohere_104b/ │ │ │ └── cohere_100b_128k_sp32.py │ │ ├── custom_dataset/ │ │ │ ├── pretrain/ │ │ │ │ ├── baichuan/ │ │ │ │ │ ├── baichuan2_13b_base_full_custom_pretrain_e1.py │ │ │ │ │ └── baichuan2_7b_base_full_custom_pretrain_e1.py │ │ │ │ ├── chatglm/ │ │ │ │ │ ├── chatglm2_6b_full_custom_pretrain_e1.py │ │ │ │ │ └── chatglm3_6b_full_custom_pretrain_e1.py │ │ │ │ ├── deepseek/ │ │ │ │ │ └── deepseek_moe_16b_base_full_custom_pretrain_e1.py │ │ │ │ ├── gemma/ │ │ │ │ │ ├── gemma_2b_full_custom_pretrain_e1.py │ │ │ │ │ └── gemma_7b_full_custom_pretrain_e1.py │ │ │ │ ├── internlm/ │ │ │ │ │ ├── internlm2_1_8b_full_custom_pretrain_e1.py │ │ │ │ │ ├── internlm2_20b_full_custom_pretrain_e1.py │ │ │ │ │ └── internlm2_7b_full_custom_pretrain_e1.py │ │ │ │ ├── llama/ │ │ │ │ │ ├── llama2_70b_full_custom_pretrain_e1.py │ │ │ │ │ └── llama2_7b_full_custom_pretrain_e1.py │ │ │ │ ├── mistral/ │ │ │ │ │ └── mistral_7b_full_custom_pretrain_e1.py │ │ │ │ ├── mixtral/ │ │ │ │ │ └── mixtral_8x7b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen/ │ │ │ │ │ ├── qwen1_5_0_5b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen1_5_14b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen1_5_1_8b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen1_5_4b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen1_5_72b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen1_5_7b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen_1_8b_full_custom_pretrain_e1.py │ │ │ │ │ ├── qwen_72b_full_custom_pretrain_e1.py │ │ │ │ │ └── qwen_7b_full_custom_pretrain_e1.py │ │ │ │ ├── starcoder/ │ │ │ │ │ └── starcoder_full_custom_pretrain_e1.py │ │ │ │ ├── yi/ │ │ │ │ │ ├── yi_34b_full_custom_pretrain_e1.py │ │ │ │ │ └── yi_6b_full_custom_pretrain_e1.py │ │ │ │ └── zephyr/ │ │ │ │ └── zephyr_7b_beta_full_custom_pretrain_e1.py │ │ │ └── sft/ │ │ │ ├── baichuan/ │ │ │ │ ├── baichuan2_13b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── baichuan2_7b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── baichuan_13b_chat_qlora_custom_sft_e1.py │ │ │ │ └── baichuan_7b_qlora_custom_sft_e1.py │ │ │ ├── chatglm/ │ │ │ │ ├── chatglm2_6b_qlora_custom_sft_e1.py │ │ │ │ └── chatglm3_6b_qlora_custom_sft_e1.py │ │ │ ├── deepseek/ │ │ │ │ ├── deepseek_moe_16b_chat_qlora_custom_sft_e1.py │ │ │ │ └── deepseekcoder_6_7b_instruct_qlora_custom_sft_e1.py │ │ │ ├── gemma/ │ │ │ │ ├── gemma_2b_it_qlora_custom_sft_e1.py │ │ │ │ ├── gemma_2b_qlora_custom_sft_e1.py │ │ │ │ ├── gemma_7b_it_qlora_custom_sft_e1.py │ │ │ │ └── gemma_7b_qlora_custom_sft_e1.py │ │ │ ├── internlm/ │ │ │ │ ├── internlm2_chat_1_8b_qlora_custom_sft_e1.py │ │ │ │ ├── internlm2_chat_20b_qlora_custom_sft_e1.py │ │ │ │ └── internlm2_chat_7b_qlora_custom_sft_e1.py │ │ │ ├── llama/ │ │ │ │ ├── llama2_70b_qlora_custom_sft_e1.py │ │ │ │ └── llama2_7b_chat_qlora_custom_sft_e1.py │ │ │ ├── mistral/ │ │ │ │ └── mistral_7b_full_finetune_custom_sft_e1.py │ │ │ ├── mixtral/ │ │ │ │ └── mixtral_8x7b_instruct_qlora_custom_sft_e1.py │ │ │ ├── qwen/ │ │ │ │ ├── qwen1_5_0_5b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen1_5_14b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen1_5_1_8b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen1_5_4b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen1_5_72b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen1_5_7b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen_1_8b_chat_qlora_custom_sft_e1.py │ │ │ │ ├── qwen_72b_qlora_custom_sft_e1.py │ │ │ │ └── qwen_7b_chat_qlora_custom_sft_e1.py │ │ │ ├── starcoder/ │ │ │ │ └── starcoder_qlora_custom_sft_e1.py │ │ │ ├── yi/ │ │ │ │ ├── yi_34b_qlora_custom_sft_e1.py │ │ │ │ └── yi_6b_qlora_custom_sft_e1.py │ │ │ └── zephyr/ │ │ │ └── zephyr_7b_beta_qlora_custom_sft_e1.py │ │ ├── deepseek/ │ │ │ ├── README.md │ │ │ ├── deepseek_coder_6_7b_base/ │ │ │ │ └── deepseek_coder_6_7b_base_qlora_code_alpaca_e3.py │ │ │ ├── deepseek_coder_6_7b_instruct/ │ │ │ │ └── deepseekcoder_6_7b_instruct_qlora_code_alpaca_e3.py │ │ │ ├── deepseek_moe_16b_base/ │ │ │ │ ├── deepseek_moe_16b_base_full_oasst1_e3.py │ │ │ │ └── deepseek_moe_16b_base_qlora_oasst1_e3.py │ │ │ ├── deepseek_moe_16b_chat/ │ │ │ │ ├── deepseek_moe_16b_chat_full_oasst1_e3.py │ │ │ │ └── deepseek_moe_16b_chat_qlora_oasst1_e3.py │ │ │ ├── deepseek_v2_chat/ │ │ │ │ └── deepseek_v2_chat_full_alpaca_e3.py │ │ │ └── deepseek_v2_lite_chat/ │ │ │ ├── deepseek_v2_lite_chat_full_alpaca_e3.py │ │ │ └── deepseek_v2_lite_chat_full_alpaca_e3_32k_varlen.py │ │ ├── deepspeed/ │ │ │ ├── deepspeed_zero1.json │ │ │ ├── deepspeed_zero2.json │ │ │ ├── deepspeed_zero2_offload.json │ │ │ ├── deepspeed_zero3.json │ │ │ └── deepspeed_zero3_offload.json │ │ ├── dpo/ │ │ │ ├── internlm/ │ │ │ │ ├── internlm2_chat_1_8b_dpo_full.py │ │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn.py │ │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn_jsonl_dataset.py │ │ │ │ └── internlm2_chat_7b_dpo_qlora_varlenattn.py │ │ │ └── llama/ │ │ │ └── llama3_8b_instruct_dpo_qlora_varlenattn.py │ │ ├── gemma/ │ │ │ ├── gemma_2b/ │ │ │ │ ├── gemma_2b_full_alpaca_e3.py │ │ │ │ └── gemma_2b_qlora_alpaca_e3.py │ │ │ ├── gemma_2b_it/ │ │ │ │ ├── gemma_2b_it_full_alpaca_e3.py │ │ │ │ └── gemma_2b_it_qlora_alpaca_e3.py │ │ │ ├── gemma_7b/ │ │ │ │ ├── gemma_7b_full_alpaca_e3.py │ │ │ │ └── gemma_7b_qlora_alpaca_e3.py │ │ │ └── gemma_7b_it/ │ │ │ ├── gemma_7b_it_full_alpaca_e3.py │ │ │ └── gemma_7b_it_qlora_alpaca_e3.py │ │ ├── internlm/ │ │ │ ├── internlm2_1_8b/ │ │ │ │ ├── internlm2_1_8b_full_alpaca_e3.py │ │ │ │ └── internlm2_1_8b_qlora_alpaca_e3.py │ │ │ ├── internlm2_20b/ │ │ │ │ ├── internlm2_20b_full_finetune_custom_dataset_e1.py │ │ │ │ ├── internlm2_20b_qlora_alpaca_e3.py │ │ │ │ ├── internlm2_20b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── internlm2_20b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm2_20b_qlora_colorist_e5.py │ │ │ │ ├── internlm2_20b_qlora_lawyer_e3.py │ │ │ │ ├── internlm2_20b_qlora_msagent_react_e3_gpu8.py │ │ │ │ ├── internlm2_20b_qlora_oasst1_512_e3.py │ │ │ │ ├── internlm2_20b_qlora_oasst1_e3.py │ │ │ │ └── internlm2_20b_qlora_sql_e3.py │ │ │ ├── internlm2_7b/ │ │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1.py │ │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4.py │ │ │ │ ├── internlm2_7b_qlora_alpaca_e3.py │ │ │ │ ├── internlm2_7b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── internlm2_7b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm2_7b_qlora_colorist_e5.py │ │ │ │ ├── internlm2_7b_qlora_json_e3.py │ │ │ │ ├── internlm2_7b_qlora_lawyer_e3.py │ │ │ │ ├── internlm2_7b_qlora_msagent_react_e3_gpu8.py │ │ │ │ ├── internlm2_7b_qlora_oasst1_512_e3.py │ │ │ │ ├── internlm2_7b_qlora_oasst1_e3.py │ │ │ │ ├── internlm2_7b_qlora_sql_e3.py │ │ │ │ ├── internlm2_7b_w_internevo_dataset.py │ │ │ │ ├── internlm2_7b_w_tokenized_dataset.py │ │ │ │ └── internlm2_7b_w_untokenized_dataset.py │ │ │ ├── internlm2_chat_1_8b/ │ │ │ │ ├── internlm2_chat_1_8b_full_alpaca_e3.py │ │ │ │ └── internlm2_chat_1_8b_qlora_alpaca_e3.py │ │ │ ├── internlm2_chat_20b/ │ │ │ │ ├── internlm2_chat_20b_full_finetune_custom_dataset_e1.py │ │ │ │ ├── internlm2_chat_20b_qlora_alpaca_e3.py │ │ │ │ ├── internlm2_chat_20b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm2_chat_20b_qlora_lawyer_e3.py │ │ │ │ ├── internlm2_chat_20b_qlora_oasst1_512_e3.py │ │ │ │ └── internlm2_chat_20b_qlora_oasst1_e3.py │ │ │ ├── internlm2_chat_7b/ │ │ │ │ ├── internlm2_chat_7b_full_finetune_custom_dataset_e1.py │ │ │ │ ├── internlm2_chat_7b_qlora_alpaca_e3.py │ │ │ │ ├── internlm2_chat_7b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm2_chat_7b_qlora_lawyer_e3.py │ │ │ │ ├── internlm2_chat_7b_qlora_oasst1_512_e3.py │ │ │ │ └── internlm2_chat_7b_qlora_oasst1_e3.py │ │ │ ├── internlm_20b/ │ │ │ │ ├── internlm_20b_qlora_alpaca_e3.py │ │ │ │ ├── internlm_20b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── internlm_20b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── internlm_20b_qlora_alpaca_zh_e3.py │ │ │ │ ├── internlm_20b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── internlm_20b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm_20b_qlora_colorist_e5.py │ │ │ │ ├── internlm_20b_qlora_lawyer_e3.py │ │ │ │ ├── internlm_20b_qlora_msagent_react_e3_gpu8.py │ │ │ │ ├── internlm_20b_qlora_oasst1_512_e3.py │ │ │ │ ├── internlm_20b_qlora_oasst1_e3.py │ │ │ │ ├── internlm_20b_qlora_open_platypus_e3.py │ │ │ │ └── internlm_20b_qlora_sql_e3.py │ │ │ ├── internlm_7b/ │ │ │ │ ├── internlm_7b_full_alpaca_e3.py │ │ │ │ ├── internlm_7b_full_alpaca_enzh_e3.py │ │ │ │ ├── internlm_7b_full_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── internlm_7b_full_alpaca_zh_e3.py │ │ │ │ ├── internlm_7b_full_intern_repo_dataset_template.py │ │ │ │ ├── internlm_7b_full_oasst1_e3.py │ │ │ │ ├── internlm_7b_qlora_alpaca_e3.py │ │ │ │ ├── internlm_7b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── internlm_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── internlm_7b_qlora_alpaca_zh_e3.py │ │ │ │ ├── internlm_7b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── internlm_7b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm_7b_qlora_colorist_e5.py │ │ │ │ ├── internlm_7b_qlora_json_e3.py │ │ │ │ ├── internlm_7b_qlora_lawyer_e3.py │ │ │ │ ├── internlm_7b_qlora_medical_e1.py │ │ │ │ ├── internlm_7b_qlora_moss_sft_all_e1.py │ │ │ │ ├── internlm_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ │ ├── internlm_7b_qlora_moss_sft_plugins_e1.py │ │ │ │ ├── internlm_7b_qlora_msagent_react_e3_gpu8.py │ │ │ │ ├── internlm_7b_qlora_oasst1_512_e3.py │ │ │ │ ├── internlm_7b_qlora_oasst1_e3.py │ │ │ │ ├── internlm_7b_qlora_oasst1_e3_hf.py │ │ │ │ ├── internlm_7b_qlora_oasst1_mmlu_e3.py │ │ │ │ ├── internlm_7b_qlora_open_platypus_e3.py │ │ │ │ ├── internlm_7b_qlora_openorca_e1.py │ │ │ │ ├── internlm_7b_qlora_sql_e3.py │ │ │ │ └── internlm_7b_qlora_tiny_codes_e1.py │ │ │ ├── internlm_chat_20b/ │ │ │ │ ├── internlm_chat_20b_qlora_alpaca_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_alpaca_zh_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_code_alpaca_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_lawyer_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_oasst1_512_e3.py │ │ │ │ ├── internlm_chat_20b_qlora_oasst1_e3.py │ │ │ │ └── internlm_chat_20b_qlora_open_platypus_e3.py │ │ │ └── internlm_chat_7b/ │ │ │ ├── internlm_chat_7b_qlora_alpaca_e3.py │ │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_e3.py │ │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── internlm_chat_7b_qlora_alpaca_zh_e3.py │ │ │ ├── internlm_chat_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── internlm_chat_7b_qlora_code_alpaca_e3.py │ │ │ ├── internlm_chat_7b_qlora_colorist_e5.py │ │ │ ├── internlm_chat_7b_qlora_lawyer_e3.py │ │ │ ├── internlm_chat_7b_qlora_medical_e1.py │ │ │ ├── internlm_chat_7b_qlora_oasst1_512_e3.py │ │ │ ├── internlm_chat_7b_qlora_oasst1_e3.py │ │ │ ├── internlm_chat_7b_qlora_open_platypus_e3.py │ │ │ ├── internlm_chat_7b_qlora_openorca_e1.py │ │ │ ├── internlm_chat_7b_qlora_sql_e3.py │ │ │ └── internlm_chat_7b_qlora_tiny_codes_e1.py │ │ ├── llama/ │ │ │ ├── llama2_70b/ │ │ │ │ ├── llama2_70b_full_wizardlm_e1.py │ │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1.py │ │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1_hf.py │ │ │ │ ├── llama2_70b_qlora_open_platypus_e1.py │ │ │ │ └── llama2_70b_qlora_open_platypus_e1_hf.py │ │ │ ├── llama2_7b/ │ │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp1.py │ │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp4.py │ │ │ │ ├── llama2_7b_full_wizardlm_e1.py │ │ │ │ ├── llama2_7b_qlora_alpaca_e3.py │ │ │ │ ├── llama2_7b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── llama2_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── llama2_7b_qlora_alpaca_zh_e3.py │ │ │ │ ├── llama2_7b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── llama2_7b_qlora_code_alpaca_e3.py │ │ │ │ ├── llama2_7b_qlora_colorist_e5.py │ │ │ │ ├── llama2_7b_qlora_lawyer_e3.py │ │ │ │ ├── llama2_7b_qlora_medical_e1.py │ │ │ │ ├── llama2_7b_qlora_moss_sft_all_e1.py │ │ │ │ ├── llama2_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ │ ├── llama2_7b_qlora_moss_sft_plugins_e1.py │ │ │ │ ├── llama2_7b_qlora_msagent_react_e3_gpu8.py │ │ │ │ ├── llama2_7b_qlora_oasst1_512_e3.py │ │ │ │ ├── llama2_7b_qlora_oasst1_e3.py │ │ │ │ ├── llama2_7b_qlora_open_platypus_e3.py │ │ │ │ ├── llama2_7b_qlora_openorca_e1.py │ │ │ │ ├── llama2_7b_qlora_sql_e3.py │ │ │ │ └── llama2_7b_qlora_tiny_codes_e1.py │ │ │ ├── llama2_7b_chat/ │ │ │ │ ├── llama2_7b_chat_qlora_alpaca_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_alpaca_zh_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_colorist_e5.py │ │ │ │ ├── llama2_7b_chat_qlora_lawyer_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_medical_e1.py │ │ │ │ ├── llama2_7b_chat_qlora_oasst1_512_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_oasst1_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_open_platypus_e3.py │ │ │ │ ├── llama2_7b_chat_qlora_openorca_e1.py │ │ │ │ ├── llama2_7b_chat_qlora_sql_e3.py │ │ │ │ └── llama2_7b_chat_qlora_tiny_codes_e1.py │ │ │ ├── llama3_70b_instruct/ │ │ │ │ └── llama3_70b_instruct_qlora_alpaca_e3_2k_gpu8.py │ │ │ ├── llama3_8b/ │ │ │ │ ├── README.md │ │ │ │ └── llama3_8b_full_alpaca_e3.py │ │ │ ├── llama3_8b_instruct/ │ │ │ │ ├── llama3_8b_instruct_full_alpaca_e3.py │ │ │ │ └── llama3_8b_instruct_qlora_alpaca_e3.py │ │ │ └── llama_7b/ │ │ │ ├── llama_7b_qlora_alpaca_e3.py │ │ │ ├── llama_7b_qlora_alpaca_enzh_e3.py │ │ │ ├── llama_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── llama_7b_qlora_alpaca_zh_e3.py │ │ │ ├── llama_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── llama_7b_qlora_code_alpaca_e3.py │ │ │ ├── llama_7b_qlora_colorist_e5.py │ │ │ ├── llama_7b_qlora_lawyer_e3.py │ │ │ ├── llama_7b_qlora_medical_e1.py │ │ │ ├── llama_7b_qlora_moss_sft_all_e1.py │ │ │ ├── llama_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ ├── llama_7b_qlora_moss_sft_plugins_e1.py │ │ │ ├── llama_7b_qlora_oasst1_512_e3.py │ │ │ ├── llama_7b_qlora_oasst1_e3.py │ │ │ ├── llama_7b_qlora_open_platypus_e3.py │ │ │ ├── llama_7b_qlora_openorca_e1.py │ │ │ ├── llama_7b_qlora_sql_e3.py │ │ │ └── llama_7b_qlora_tiny_codes_e1.py │ │ ├── llama_speed_benchmark/ │ │ │ ├── llama2_70b/ │ │ │ │ ├── llama2_70b_full_alpaca_enzh_128k_sp8.py │ │ │ │ ├── llama2_70b_full_alpaca_enzh_256k_sp16.py │ │ │ │ ├── llama2_70b_full_alpaca_enzh_32k_sp4.py │ │ │ │ └── llama2_70b_full_alpaca_enzh_8k_sp1.py │ │ │ ├── llama2_7b/ │ │ │ │ ├── llama2_7b_full_alpaca_enzh_128k_sp8.py │ │ │ │ ├── llama2_7b_full_alpaca_enzh_1M_sp16.py │ │ │ │ ├── llama2_7b_full_alpaca_enzh_256k_sp8.py │ │ │ │ ├── llama2_7b_full_alpaca_enzh_32k_sp1.py │ │ │ │ └── llama2_7b_full_alpaca_enzh_8k_sp1.py │ │ │ └── yi_34b/ │ │ │ ├── yi_34b_200k_full_alpaca_enzh_128k_sp8.py │ │ │ ├── yi_34b_200k_full_alpaca_enzh_256k_sp8.py │ │ │ ├── yi_34b_200k_full_alpaca_enzh_32k_sp2.py │ │ │ └── yi_34b_200k_full_alpaca_enzh_8k_sp1.py │ │ ├── llava/ │ │ │ ├── README.md │ │ │ ├── README_zh-CN.md │ │ │ ├── internlm2_chat_1_8b_clip_vit_large_p14_336/ │ │ │ │ ├── finetune/ │ │ │ │ │ └── llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ └── llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── internlm2_chat_20b_clip_vit_large_p14_336/ │ │ │ │ ├── finetune/ │ │ │ │ │ ├── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ │ └── llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ └── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── internlm2_chat_7b_clip_vit_large_p14_336/ │ │ │ │ ├── finetune/ │ │ │ │ │ ├── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ │ └── llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ └── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── internlm_chat_7b_clip_vit_large_p14_336/ │ │ │ │ ├── finetune/ │ │ │ │ │ └── llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ └── llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── llama3_70b_instruct_clip_vit_large_p14_336/ │ │ │ │ └── pretrain/ │ │ │ │ └── llava_llama3_70b_instruct_quant_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── llama3_8b_instruct_clip_vit_large_p14_336/ │ │ │ │ ├── README.md │ │ │ │ ├── convert_xtuner_weights_to_hf.py │ │ │ │ ├── convert_xtuner_weights_to_llava.py │ │ │ │ ├── finetune/ │ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py │ │ │ │ │ └── llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py │ │ │ │ └── llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py │ │ │ ├── official/ │ │ │ │ ├── llava_v15_13b/ │ │ │ │ │ ├── llava_v15_13b_finetune.py │ │ │ │ │ ├── llava_v15_13b_finetune_lora.py │ │ │ │ │ └── llava_v15_13b_pretrain.py │ │ │ │ └── llava_v15_7b/ │ │ │ │ ├── llava_v15_7b_finetune.py │ │ │ │ ├── llava_v15_7b_finetune_lora.py │ │ │ │ └── llava_v15_7b_pretrain.py │ │ │ ├── phi3_mini_4k_instruct_clip_vit_large_p14_336/ │ │ │ │ ├── README.md │ │ │ │ ├── convert_phi_to_llama.py │ │ │ │ ├── convert_xtuner_weights_to_hf.py │ │ │ │ ├── convert_xtuner_weights_to_llava.py │ │ │ │ ├── finetune/ │ │ │ │ │ ├── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ │ └── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ ├── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ │ └── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py │ │ │ ├── vicuna_13b_v15_clip_vit_large_p14_336/ │ │ │ │ ├── finetune/ │ │ │ │ │ └── llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── pretrain/ │ │ │ │ └── llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ └── vicuna_7b_v15_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ ├── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ └── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_refcoco.py │ │ │ └── pretrain/ │ │ │ └── llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── mistral/ │ │ │ ├── mistral_7b_full_finetune_custom_dataset_e1.py │ │ │ ├── mistral_7b_qlora_skypile_pretrain_e1.py │ │ │ ├── mistral_7b_w_tokenized_dataset.py │ │ │ └── mistral_7b_w_untokenized_dataset.py │ │ ├── mixtral/ │ │ │ ├── README.md │ │ │ ├── mixtral_8x7b/ │ │ │ │ ├── mixtral_8x7b_full_oasst1_e3.py │ │ │ │ └── mixtral_8x7b_qlora_oasst1_e3.py │ │ │ └── mixtral_8x7b_instruct/ │ │ │ ├── mixtral_8x7b_instruct_full_oasst1_e3.py │ │ │ └── mixtral_8x7b_instruct_qlora_oasst1_e3.py │ │ ├── orpo/ │ │ │ ├── internlm/ │ │ │ │ ├── internlm2_chat_1_8b_orpo_full.py │ │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn.py │ │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn_jsonl_dataset.py │ │ │ │ └── internlm2_chat_7b_orpo_qlora_varlenattn_ultrafeedback_e5.py │ │ │ └── llama/ │ │ │ └── llama3_8b_instruct_orpo_qlora_varlenattn_ultrafeedback_e5.py │ │ ├── phi/ │ │ │ └── phi3/ │ │ │ ├── phi3_mini_128k_instruct_full_alpaca_e3.py │ │ │ ├── phi3_mini_128k_instruct_qlora_alpaca_e3.py │ │ │ ├── phi3_mini_4k_instruct_full_alpaca_e3.py │ │ │ └── phi3_mini_4k_instruct_qlora_alpaca_e3.py │ │ ├── qwen/ │ │ │ ├── qwen1/ │ │ │ │ ├── qwen_1_8b/ │ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_e3.py │ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_e3.py │ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_zh_e3.py │ │ │ │ │ └── qwen_1_8b_qlora_code_alpaca_e3.py │ │ │ │ ├── qwen_1_8b_chat/ │ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_e3.py │ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_zh_e3.py │ │ │ │ │ └── qwen_1_8b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── qwen_72b/ │ │ │ │ │ ├── qwen_72b_qlora_alpaca_e3.py │ │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_e3.py │ │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ │ ├── qwen_72b_qlora_alpaca_zh_e3.py │ │ │ │ │ └── qwen_72b_qlora_code_alpaca_e3.py │ │ │ │ ├── qwen_7b/ │ │ │ │ │ ├── qwen_7b_qlora_alpaca_e3.py │ │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_e3.py │ │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ │ ├── qwen_7b_qlora_alpaca_zh_e3.py │ │ │ │ │ ├── qwen_7b_qlora_arxiv_gentitle_e3.py │ │ │ │ │ ├── qwen_7b_qlora_code_alpaca_e3.py │ │ │ │ │ ├── qwen_7b_qlora_colorist_e5.py │ │ │ │ │ ├── qwen_7b_qlora_lawyer_e3.py │ │ │ │ │ ├── qwen_7b_qlora_medical_e1.py │ │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e1.py │ │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ │ │ ├── qwen_7b_qlora_moss_sft_plugins_e1.py │ │ │ │ │ ├── qwen_7b_qlora_oasst1_512_e3.py │ │ │ │ │ ├── qwen_7b_qlora_oasst1_e3.py │ │ │ │ │ ├── qwen_7b_qlora_open_platypus_e3.py │ │ │ │ │ ├── qwen_7b_qlora_openorca_e1.py │ │ │ │ │ ├── qwen_7b_qlora_sql_e3.py │ │ │ │ │ └── qwen_7b_qlora_tiny_codes_e1.py │ │ │ │ └── qwen_7b_chat/ │ │ │ │ ├── qwen_7b_chat_qlora_alpaca_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_alpaca_zh_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_code_alpaca_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_colorist_e5.py │ │ │ │ ├── qwen_7b_chat_qlora_lawyer_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_medical_e1.py │ │ │ │ ├── qwen_7b_chat_qlora_oasst1_512_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_oasst1_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_open_platypus_e3.py │ │ │ │ ├── qwen_7b_chat_qlora_openorca_e1.py │ │ │ │ ├── qwen_7b_chat_qlora_sql_e3.py │ │ │ │ └── qwen_7b_chat_qlora_tiny_codes_e1.py │ │ │ └── qwen1_5/ │ │ │ ├── qwen1_5_0_5b/ │ │ │ │ ├── qwen1_5_0_5b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_0_5b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_0_5b_chat/ │ │ │ │ ├── qwen1_5_0_5b_chat_full_alpaca_e3.py │ │ │ │ └── qwen1_5_0_5b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_110b/ │ │ │ │ ├── qwen1_5_110b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_110b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_110b_chat/ │ │ │ │ ├── README.md │ │ │ │ ├── qwen1_5_110b_chat_full_alpaca_e3.py │ │ │ │ ├── qwen1_5_110b_chat_qlora_alpaca_e3.py │ │ │ │ └── qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py │ │ │ ├── qwen1_5_14b/ │ │ │ │ ├── qwen1_5_14b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_14b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_14b_chat/ │ │ │ │ ├── qwen1_5_14b_chat_full_alpaca_e3.py │ │ │ │ └── qwen1_5_14b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_1_8b/ │ │ │ │ ├── qwen1_5_1_8b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_1_8b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_1_8b_chat/ │ │ │ │ ├── qwen1_5_1_8b_chat_full_alpaca_e3.py │ │ │ │ └── qwen1_5_1_8b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_4b/ │ │ │ │ ├── qwen1_5_4b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_4b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_4b_chat/ │ │ │ │ ├── qwen1_5_4b_chat_full_alpaca_e3.py │ │ │ │ └── qwen1_5_4b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_72b/ │ │ │ │ ├── qwen1_5_72b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_72b_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_72b_chat/ │ │ │ │ ├── qwen1_5_72b_chat_full_alpaca_e3.py │ │ │ │ └── qwen1_5_72b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen1_5_7b/ │ │ │ │ ├── qwen1_5_7b_full_alpaca_e3.py │ │ │ │ └── qwen1_5_7b_qlora_alpaca_e3.py │ │ │ └── qwen1_5_7b_chat/ │ │ │ ├── qwen1_5_7b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_7b_chat_qlora_alpaca_e3.py │ │ ├── qwen_moe/ │ │ │ └── qwen1_5/ │ │ │ └── qwen1_5_moe_a2_7_b_chat/ │ │ │ └── qwen1_5_moe_a2_7_b_chat_full_alpaca_e3.py │ │ ├── reward_model/ │ │ │ ├── internlm/ │ │ │ │ ├── internlm2_chat_1_8b_reward_full_ultrafeedback.py │ │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py │ │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_ultrafeedback.py │ │ │ │ └── internlm2_chat_1_8b_reward_qlora_varlenattn_ultrafeedback.py │ │ │ └── llama/ │ │ │ └── llama3_8b_instruct_reward_full_varlenattn_ultrafeedback.py │ │ ├── starcoder/ │ │ │ └── starcoder_qlora_stack_exchange_example.py │ │ ├── yi/ │ │ │ ├── yi_34b/ │ │ │ │ └── yi_34b_qlora_alpaca_enzh_e3.py │ │ │ └── yi_6b/ │ │ │ └── yi_6b_qlora_alpaca_enzh_e3.py │ │ └── zephyr/ │ │ └── zephyr_7b_beta_qlora_alpaca_e3.py │ ├── dataset/ │ │ ├── __init__.py │ │ ├── collate_fns/ │ │ │ ├── __init__.py │ │ │ ├── default_collate_fn.py │ │ │ ├── mmlu_collate_fn.py │ │ │ └── preference_collate_fn.py │ │ ├── concat_dataset.py │ │ ├── huggingface.py │ │ ├── intern_repo.py │ │ ├── json_dataset.py │ │ ├── llava.py │ │ ├── map_fns/ │ │ │ ├── __init__.py │ │ │ ├── dataset_map_fns/ │ │ │ │ ├── __init__.py │ │ │ │ ├── alpaca_map_fn.py │ │ │ │ ├── alpaca_zh_map_fn.py │ │ │ │ ├── arxiv_map_fn.py │ │ │ │ ├── code_alpaca_map_fn.py │ │ │ │ ├── colors_map_fn.py │ │ │ │ ├── crime_kg_assitant_map_fn.py │ │ │ │ ├── default_map_fn.py │ │ │ │ ├── law_reference_map_fn.py │ │ │ │ ├── llava_map_fn.py │ │ │ │ ├── medical_map_fn.py │ │ │ │ ├── msagent_map_fn.py │ │ │ │ ├── oasst1_map_fn.py │ │ │ │ ├── openai_map_fn.py │ │ │ │ ├── openorca_map_fn.py │ │ │ │ ├── pretrain_map_fn.py │ │ │ │ ├── sql_map_fn.py │ │ │ │ ├── stack_exchange_map_fn.py │ │ │ │ ├── tiny_codes_map_fn.py │ │ │ │ └── wizardlm_map_fn.py │ │ │ └── template_map_fn.py │ │ ├── modelscope.py │ │ ├── moss_sft.py │ │ ├── preference_dataset.py │ │ ├── refcoco_json.py │ │ ├── samplers/ │ │ │ ├── __init__.py │ │ │ ├── intern_repo.py │ │ │ └── length_grouped.py │ │ └── utils.py │ ├── engine/ │ │ ├── __init__.py │ │ ├── _strategy/ │ │ │ ├── __init__.py │ │ │ └── deepspeed.py │ │ ├── hooks/ │ │ │ ├── __init__.py │ │ │ ├── dataset_info_hook.py │ │ │ ├── evaluate_chat_hook.py │ │ │ ├── hf_checkpoint_hook.py │ │ │ ├── throughput_hook.py │ │ │ └── varlen_attn_args_to_messagehub_hook.py │ │ └── runner/ │ │ ├── __init__.py │ │ └── loops.py │ ├── entry_point.py │ ├── evaluation/ │ │ ├── __init__.py │ │ └── metrics/ │ │ ├── __init__.py │ │ ├── mmlu_metric.py │ │ └── reward_metric.py │ ├── model/ │ │ ├── __init__.py │ │ ├── dpo.py │ │ ├── llava.py │ │ ├── modules/ │ │ │ ├── __init__.py │ │ │ ├── dispatch/ │ │ │ │ ├── __init__.py │ │ │ │ ├── attention.py │ │ │ │ ├── baichuan.py │ │ │ │ ├── cohere.py │ │ │ │ ├── deepseek_v2.py │ │ │ │ ├── internlm.py │ │ │ │ ├── internlm2.py │ │ │ │ ├── llama.py │ │ │ │ ├── mistral.py │ │ │ │ ├── phi3.py │ │ │ │ ├── qwen2.py │ │ │ │ ├── triton_kernels/ │ │ │ │ │ ├── __init__.py │ │ │ │ │ ├── layer_norm.py │ │ │ │ │ ├── rms_norm.py │ │ │ │ │ └── rotary.py │ │ │ │ ├── utils.py │ │ │ │ └── yi.py │ │ │ └── projector/ │ │ │ ├── __init__.py │ │ │ ├── configuration_projector.py │ │ │ └── modeling_projector.py │ │ ├── orpo.py │ │ ├── reward.py │ │ ├── sft.py │ │ ├── transformers_models/ │ │ │ ├── __init__.py │ │ │ ├── deepseek_v2/ │ │ │ │ ├── __init__.py │ │ │ │ ├── configuration_deepseek.py │ │ │ │ ├── modeling_deepseek.py │ │ │ │ └── tokenization_deepseek_fast.py │ │ │ └── mixtral/ │ │ │ ├── __init__.py │ │ │ ├── configuration_mixtral.py │ │ │ └── modeling_mixtral.py │ │ └── utils.py │ ├── parallel/ │ │ ├── __init__.py │ │ └── sequence/ │ │ ├── __init__.py │ │ ├── attention.py │ │ ├── comm.py │ │ ├── data_collate.py │ │ ├── reduce_loss.py │ │ ├── sampler.py │ │ └── setup_distributed.py │ ├── registry.py │ ├── tools/ │ │ ├── chat.py │ │ ├── check_custom_dataset.py │ │ ├── copy_cfg.py │ │ ├── data_preprocess/ │ │ │ ├── arxiv.py │ │ │ └── convert_refcoco.py │ │ ├── eval_refcoco.py │ │ ├── get_data_order.py │ │ ├── list_cfg.py │ │ ├── list_dataset_format.py │ │ ├── log_dataset.py │ │ ├── mmbench.py │ │ ├── model_converters/ │ │ │ ├── merge.py │ │ │ ├── modeling_internlm2_reward/ │ │ │ │ ├── __init__.py │ │ │ │ ├── configuration_internlm2.py │ │ │ │ └── modeling_internlm2.py │ │ │ ├── pth_to_hf.py │ │ │ └── split.py │ │ ├── plugins/ │ │ │ ├── __init__.py │ │ │ ├── api.py │ │ │ ├── calculate.py │ │ │ ├── search.py │ │ │ └── solve.py │ │ ├── process_untokenized_datasets.py │ │ ├── process_untokenized_datasets_legacy.py │ │ ├── process_untokenized_llava_data.py │ │ ├── test.py │ │ ├── tokenize_ftdp_datasets.py │ │ ├── train.py │ │ └── utils.py │ ├── utils/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── fileio.py │ │ ├── handle_moe_load_and_save.py │ │ ├── stop_criteria.py │ │ ├── templates.py │ │ └── zero_to_any_dtype.py │ └── version.py └── xtuner-train_internvideo2_5/ ├── .gitignore ├── .owners.yml ├── .pre-commit-config-zh-cn.yaml ├── .pre-commit-config.yaml ├── LICENSE ├── MANIFEST.in ├── README.md ├── data/ │ ├── annotaions/ │ │ └── ft_data_example.jsonl │ └── diy_ft_data.json ├── ft_internvideo_2_5.sh ├── ft_internvideo_2_5_datapacking.sh ├── requirements/ │ ├── deepspeed.txt │ ├── docs.txt │ ├── modelscope.txt │ └── runtime.txt ├── requirements.txt ├── setup.cfg ├── setup.py ├── unify_internvl2_train_r16.py └── xtuner/ ├── __init__.py ├── _lite/ │ ├── __init__.py │ ├── accelerate/ │ │ ├── __init__.py │ │ ├── dispatches/ │ │ │ ├── __init__.py │ │ │ ├── _attention.py │ │ │ ├── _fused/ │ │ │ │ ├── __init__.py │ │ │ │ ├── layer_norm.py │ │ │ │ ├── rms_norm.py │ │ │ │ └── rotary.py │ │ │ ├── clip.py │ │ │ ├── internlm2.py │ │ │ ├── internvl2.py │ │ │ ├── llama3.py │ │ │ ├── new.py │ │ │ ├── phi3.py │ │ │ ├── qwen2.py │ │ │ └── qwen_vl2.py │ │ ├── fsdp/ │ │ │ ├── __init__.py │ │ │ ├── checkpointing.py │ │ │ ├── clip_grad.py │ │ │ ├── lazy.py │ │ │ ├── precision.py │ │ │ └── wrap.py │ │ ├── generate.py │ │ ├── lora.py │ │ └── packed.py │ ├── auto.py │ ├── chat/ │ │ ├── __init__.py │ │ ├── backends/ │ │ │ └── __init__.py │ │ ├── messages/ │ │ │ ├── __init__.py │ │ │ ├── base.py │ │ │ └── chat.py │ │ └── templates/ │ │ ├── __init__.py │ │ ├── chat.py │ │ └── hybrid.py │ ├── checkpoint.py │ ├── datasets/ │ │ ├── __init__.py │ │ ├── dataset_fn.py │ │ ├── format.py │ │ ├── llava.py │ │ ├── load.py │ │ ├── load_new.py │ │ ├── text.py │ │ └── tokenize.py │ ├── internvl/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── conversation.py │ │ ├── dataset.py │ │ ├── new_dataset.py │ │ ├── v1_5/ │ │ │ ├── configuration_intern_vit.py │ │ │ ├── configuration_internvl_chat.py │ │ │ ├── configuration_phi3.py │ │ │ ├── conversation.py │ │ │ ├── modeling_intern_vit.py │ │ │ ├── modeling_internvl_chat.py │ │ │ └── modeling_phi3.py │ │ └── video_utils.py │ ├── modelings/ │ │ ├── __init__.py │ │ ├── internlm2/ │ │ │ ├── __init__.py │ │ │ ├── configuration_internlm2.py │ │ │ └── modeling_internlm2.py │ │ └── model_fn.py │ ├── parallel/ │ │ ├── __init__.py │ │ ├── comm.py │ │ ├── logger.py │ │ ├── new_setup.py │ │ ├── plans/ │ │ │ └── internlm2.py │ │ ├── sampler.py │ │ ├── sequence/ │ │ │ ├── __init__.py │ │ │ ├── attention.py │ │ │ ├── data_collate.py │ │ │ ├── ops.py │ │ │ └── reduce_loss.py │ │ └── setup.py │ └── yunchang/ │ ├── __init__.py │ ├── comm/ │ │ ├── __init__.py │ │ ├── all_to_all.py │ │ └── extract_local.py │ ├── globals.py │ ├── hybrid/ │ │ ├── __init__.py │ │ ├── async_attn_layer.py │ │ ├── attn_layer.py │ │ └── utils.py │ ├── ring/ │ │ ├── __init__.py │ │ ├── llama3_flash_attn_varlen.py │ │ ├── ring_flash_attn.py │ │ ├── ring_flash_attn_varlen.py │ │ ├── stripe_flash_attn.py │ │ ├── triton_utils.py │ │ ├── utils.py │ │ ├── zigzag_ring_flash_attn.py │ │ └── zigzag_ring_flash_attn_varlen.py │ └── ulysses/ │ ├── __init__.py │ └── attn_layer.py ├── apis/ │ ├── __init__.py │ ├── datasets/ │ │ ├── __init__.py │ │ ├── alpaca.py │ │ ├── arxiv.py │ │ ├── code_alpaca.py │ │ ├── colorist.py │ │ ├── lawyer.py │ │ ├── medical.py │ │ ├── moss_003_sft.py │ │ ├── oasst1.py │ │ ├── open_orca.py │ │ ├── sql.py │ │ ├── tiny_codes.py │ │ └── wizardlm.py │ ├── model.py │ └── training_args.py ├── configs/ │ ├── __init__.py │ ├── baichuan/ │ │ ├── baichuan2_13b_base/ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_e3.py │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan2_13b_base_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan2_13b_base_qlora_arxiv_gentitle_e3.py │ │ │ ├── baichuan2_13b_base_qlora_code_alpaca_e3.py │ │ │ ├── baichuan2_13b_base_qlora_colorist_e5.py │ │ │ ├── baichuan2_13b_base_qlora_lawyer_e3.py │ │ │ ├── baichuan2_13b_base_qlora_oasst1_512_e3.py │ │ │ ├── baichuan2_13b_base_qlora_oasst1_e3.py │ │ │ ├── baichuan2_13b_base_qlora_open_platypus_e3.py │ │ │ └── baichuan2_13b_base_qlora_sql_e3.py │ │ ├── baichuan2_13b_chat/ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_code_alpaca_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_lawyer_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_512_e3.py │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_e3.py │ │ │ └── baichuan2_13b_chat_qlora_open_platypus_e3.py │ │ ├── baichuan2_7b_base/ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_e3.py │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan2_7b_base_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan2_7b_base_qlora_arxiv_gentitle_e3.py │ │ │ ├── baichuan2_7b_base_qlora_code_alpaca_e3.py │ │ │ ├── baichuan2_7b_base_qlora_colorist_e5.py │ │ │ ├── baichuan2_7b_base_qlora_lawyer_e3.py │ │ │ ├── baichuan2_7b_base_qlora_oasst1_512_e3.py │ │ │ ├── baichuan2_7b_base_qlora_oasst1_e3.py │ │ │ ├── baichuan2_7b_base_qlora_open_platypus_e3.py │ │ │ └── baichuan2_7b_base_qlora_sql_e3.py │ │ ├── baichuan2_7b_chat/ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_code_alpaca_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_lawyer_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_512_e3.py │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_e3.py │ │ │ └── baichuan2_7b_chat_qlora_open_platypus_e3.py │ │ ├── baichuan_13b_base/ │ │ │ ├── baichuan_13b_base_qlora_alpaca_e3.py │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan_13b_base_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan_13b_base_qlora_arxiv_gentitle_e3.py │ │ │ ├── baichuan_13b_base_qlora_code_alpaca_e3.py │ │ │ ├── baichuan_13b_base_qlora_colorist_e5.py │ │ │ ├── baichuan_13b_base_qlora_lawyer_e3.py │ │ │ ├── baichuan_13b_base_qlora_medical_e1.py │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e1.py │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e2_gpu8.py │ │ │ ├── baichuan_13b_base_qlora_moss_sft_plugins_e1.py │ │ │ ├── baichuan_13b_base_qlora_oasst1_512_e3.py │ │ │ ├── baichuan_13b_base_qlora_oasst1_e3.py │ │ │ ├── baichuan_13b_base_qlora_open_platypus_e3.py │ │ │ ├── baichuan_13b_base_qlora_openorca_e1.py │ │ │ ├── baichuan_13b_base_qlora_sql_e3.py │ │ │ └── baichuan_13b_base_qlora_tiny_codes_e1.py │ │ ├── baichuan_13b_chat/ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_e3.py │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_e3.py │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── baichuan_13b_chat_qlora_alpaca_zh_e3.py │ │ │ ├── baichuan_13b_chat_qlora_arxiv_gentitle_e3.py │ │ │ ├── baichuan_13b_chat_qlora_code_alpaca_e3.py │ │ │ ├── baichuan_13b_chat_qlora_colorist_e5.py │ │ │ ├── baichuan_13b_chat_qlora_lawyer_e3.py │ │ │ ├── baichuan_13b_chat_qlora_medical_e1.py │ │ │ ├── baichuan_13b_chat_qlora_oasst1_512_e3.py │ │ │ ├── baichuan_13b_chat_qlora_oasst1_e3.py │ │ │ ├── baichuan_13b_chat_qlora_open_platypus_e3.py │ │ │ ├── baichuan_13b_chat_qlora_openorca_e1.py │ │ │ ├── baichuan_13b_chat_qlora_sql_e3.py │ │ │ └── baichuan_13b_chat_qlora_tiny_codes_e1.py │ │ └── baichuan_7b/ │ │ ├── baichuan_7b_qlora_alpaca_e3.py │ │ ├── baichuan_7b_qlora_alpaca_enzh_e3.py │ │ ├── baichuan_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ ├── baichuan_7b_qlora_alpaca_zh_e3.py │ │ ├── baichuan_7b_qlora_arxiv_gentitle_e3.py │ │ ├── baichuan_7b_qlora_code_alpaca_e3.py │ │ ├── baichuan_7b_qlora_colorist_e5.py │ │ ├── baichuan_7b_qlora_lawyer_e3.py │ │ ├── baichuan_7b_qlora_medical_e1.py │ │ ├── baichuan_7b_qlora_moss_sft_all_e1.py │ │ ├── baichuan_7b_qlora_moss_sft_all_e2_gpu8.py │ │ ├── baichuan_7b_qlora_moss_sft_plugins_e1.py │ │ ├── baichuan_7b_qlora_oasst1_512_e3.py │ │ ├── baichuan_7b_qlora_oasst1_e3.py │ │ ├── baichuan_7b_qlora_open_platypus_e3.py │ │ ├── baichuan_7b_qlora_openorca_e1.py │ │ ├── baichuan_7b_qlora_sql_e3.py │ │ └── baichuan_7b_qlora_tiny_codes_e1.py │ ├── chatglm/ │ │ ├── chatglm2_6b/ │ │ │ ├── chatglm2_6b_qlora_alpaca_e3.py │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_e3.py │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── chatglm2_6b_qlora_alpaca_zh_e3.py │ │ │ ├── chatglm2_6b_qlora_arxiv_gentitle_e3.py │ │ │ ├── chatglm2_6b_qlora_code_alpaca_e3.py │ │ │ ├── chatglm2_6b_qlora_colorist_e5.py │ │ │ ├── chatglm2_6b_qlora_lawyer_e3.py │ │ │ ├── chatglm2_6b_qlora_medical_e1.py │ │ │ ├── chatglm2_6b_qlora_oasst1_512_e3.py │ │ │ ├── chatglm2_6b_qlora_oasst1_e3.py │ │ │ ├── chatglm2_6b_qlora_open_platypus_e3.py │ │ │ ├── chatglm2_6b_qlora_openorca_e1.py │ │ │ ├── chatglm2_6b_qlora_sql_e3.py │ │ │ └── chatglm2_6b_qlora_tiny_codes_e1.py │ │ ├── chatglm3_6b/ │ │ │ ├── chatglm3_6b_qlora_alpaca_e3.py │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_e3.py │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── chatglm3_6b_qlora_alpaca_zh_e3.py │ │ │ ├── chatglm3_6b_qlora_arxiv_gentitle_e3.py │ │ │ ├── chatglm3_6b_qlora_code_alpaca_e3.py │ │ │ ├── chatglm3_6b_qlora_colorist_e5.py │ │ │ ├── chatglm3_6b_qlora_lawyer_e3.py │ │ │ ├── chatglm3_6b_qlora_medical_e1.py │ │ │ ├── chatglm3_6b_qlora_oasst1_512_e3.py │ │ │ ├── chatglm3_6b_qlora_oasst1_e3.py │ │ │ ├── chatglm3_6b_qlora_open_platypus_e3.py │ │ │ ├── chatglm3_6b_qlora_openorca_e1.py │ │ │ ├── chatglm3_6b_qlora_sql_e3.py │ │ │ └── chatglm3_6b_qlora_tiny_codes_e1.py │ │ └── chatglm3_6b_base/ │ │ ├── chatglm3_6b_base_qlora_alpaca_e3.py │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_e3.py │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_oasst1_e3.py │ │ ├── chatglm3_6b_base_qlora_alpaca_zh_e3.py │ │ ├── chatglm3_6b_base_qlora_arxiv_gentitle_e3.py │ │ ├── chatglm3_6b_base_qlora_code_alpaca_e3.py │ │ ├── chatglm3_6b_base_qlora_colorist_e5.py │ │ ├── chatglm3_6b_base_qlora_lawyer_e3.py │ │ ├── chatglm3_6b_base_qlora_medical_e1.py │ │ ├── chatglm3_6b_base_qlora_oasst1_512_e3.py │ │ ├── chatglm3_6b_base_qlora_oasst1_e3.py │ │ ├── chatglm3_6b_base_qlora_open_platypus_e3.py │ │ ├── chatglm3_6b_base_qlora_openorca_e1.py │ │ ├── chatglm3_6b_base_qlora_sql_e3.py │ │ └── chatglm3_6b_base_qlora_tiny_codes_e1.py │ ├── cohere/ │ │ ├── README.md │ │ └── cohere_104b/ │ │ └── cohere_100b_128k_sp32.py │ ├── custom_dataset/ │ │ ├── pretrain/ │ │ │ ├── baichuan/ │ │ │ │ ├── baichuan2_13b_base_full_custom_pretrain_e1.py │ │ │ │ └── baichuan2_7b_base_full_custom_pretrain_e1.py │ │ │ ├── chatglm/ │ │ │ │ ├── chatglm2_6b_full_custom_pretrain_e1.py │ │ │ │ └── chatglm3_6b_full_custom_pretrain_e1.py │ │ │ ├── deepseek/ │ │ │ │ └── deepseek_moe_16b_base_full_custom_pretrain_e1.py │ │ │ ├── gemma/ │ │ │ │ ├── gemma_2b_full_custom_pretrain_e1.py │ │ │ │ └── gemma_7b_full_custom_pretrain_e1.py │ │ │ ├── internlm/ │ │ │ │ ├── internlm2_1_8b_full_custom_pretrain_e1.py │ │ │ │ ├── internlm2_20b_full_custom_pretrain_e1.py │ │ │ │ └── internlm2_7b_full_custom_pretrain_e1.py │ │ │ ├── llama/ │ │ │ │ ├── llama2_70b_full_custom_pretrain_e1.py │ │ │ │ └── llama2_7b_full_custom_pretrain_e1.py │ │ │ ├── mistral/ │ │ │ │ └── mistral_7b_full_custom_pretrain_e1.py │ │ │ ├── mixtral/ │ │ │ │ └── mixtral_8x7b_full_custom_pretrain_e1.py │ │ │ ├── qwen/ │ │ │ │ ├── qwen1_5_0_5b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen1_5_14b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen1_5_1_8b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen1_5_4b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen1_5_72b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen1_5_7b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen_1_8b_full_custom_pretrain_e1.py │ │ │ │ ├── qwen_72b_full_custom_pretrain_e1.py │ │ │ │ └── qwen_7b_full_custom_pretrain_e1.py │ │ │ ├── starcoder/ │ │ │ │ └── starcoder_full_custom_pretrain_e1.py │ │ │ ├── yi/ │ │ │ │ ├── yi_34b_full_custom_pretrain_e1.py │ │ │ │ └── yi_6b_full_custom_pretrain_e1.py │ │ │ └── zephyr/ │ │ │ └── zephyr_7b_beta_full_custom_pretrain_e1.py │ │ └── sft/ │ │ ├── baichuan/ │ │ │ ├── baichuan2_13b_chat_qlora_custom_sft_e1.py │ │ │ ├── baichuan2_7b_chat_qlora_custom_sft_e1.py │ │ │ ├── baichuan_13b_chat_qlora_custom_sft_e1.py │ │ │ └── baichuan_7b_qlora_custom_sft_e1.py │ │ ├── chatglm/ │ │ │ ├── chatglm2_6b_qlora_custom_sft_e1.py │ │ │ └── chatglm3_6b_qlora_custom_sft_e1.py │ │ ├── deepseek/ │ │ │ ├── deepseek_moe_16b_chat_qlora_custom_sft_e1.py │ │ │ └── deepseekcoder_6_7b_instruct_qlora_custom_sft_e1.py │ │ ├── gemma/ │ │ │ ├── gemma_2b_it_qlora_custom_sft_e1.py │ │ │ ├── gemma_2b_qlora_custom_sft_e1.py │ │ │ ├── gemma_7b_it_qlora_custom_sft_e1.py │ │ │ └── gemma_7b_qlora_custom_sft_e1.py │ │ ├── internlm/ │ │ │ ├── internlm2_chat_1_8b_qlora_custom_sft_e1.py │ │ │ ├── internlm2_chat_20b_qlora_custom_sft_e1.py │ │ │ └── internlm2_chat_7b_qlora_custom_sft_e1.py │ │ ├── llama/ │ │ │ ├── llama2_70b_qlora_custom_sft_e1.py │ │ │ └── llama2_7b_chat_qlora_custom_sft_e1.py │ │ ├── mistral/ │ │ │ └── mistral_7b_full_finetune_custom_sft_e1.py │ │ ├── mixtral/ │ │ │ └── mixtral_8x7b_instruct_qlora_custom_sft_e1.py │ │ ├── qwen/ │ │ │ ├── qwen1_5_0_5b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen1_5_14b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen1_5_1_8b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen1_5_4b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen1_5_72b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen1_5_7b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen_1_8b_chat_qlora_custom_sft_e1.py │ │ │ ├── qwen_72b_qlora_custom_sft_e1.py │ │ │ └── qwen_7b_chat_qlora_custom_sft_e1.py │ │ ├── starcoder/ │ │ │ └── starcoder_qlora_custom_sft_e1.py │ │ ├── yi/ │ │ │ ├── yi_34b_qlora_custom_sft_e1.py │ │ │ └── yi_6b_qlora_custom_sft_e1.py │ │ └── zephyr/ │ │ └── zephyr_7b_beta_qlora_custom_sft_e1.py │ ├── deepseek/ │ │ ├── README.md │ │ ├── deepseek_coder_6_7b_base/ │ │ │ └── deepseek_coder_6_7b_base_qlora_code_alpaca_e3.py │ │ ├── deepseek_coder_6_7b_instruct/ │ │ │ └── deepseekcoder_6_7b_instruct_qlora_code_alpaca_e3.py │ │ ├── deepseek_moe_16b_base/ │ │ │ ├── deepseek_moe_16b_base_full_oasst1_e3.py │ │ │ └── deepseek_moe_16b_base_qlora_oasst1_e3.py │ │ ├── deepseek_moe_16b_chat/ │ │ │ ├── deepseek_moe_16b_chat_full_oasst1_e3.py │ │ │ └── deepseek_moe_16b_chat_qlora_oasst1_e3.py │ │ ├── deepseek_v2_chat/ │ │ │ └── deepseek_v2_chat_full_alpaca_e3.py │ │ └── deepseek_v2_lite_chat/ │ │ ├── deepseek_v2_lite_chat_full_alpaca_e3.py │ │ └── deepseek_v2_lite_chat_full_alpaca_e3_32k_varlen.py │ ├── deepspeed/ │ │ ├── deepspeed_zero1.json │ │ ├── deepspeed_zero2.json │ │ ├── deepspeed_zero2_offload.json │ │ ├── deepspeed_zero3.json │ │ └── deepspeed_zero3_offload.json │ ├── dpo/ │ │ ├── internlm/ │ │ │ ├── internlm2_chat_1_8b_dpo_full.py │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn.py │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn_jsonl_dataset.py │ │ │ └── internlm2_chat_7b_dpo_qlora_varlenattn.py │ │ └── llama/ │ │ └── llama3_8b_instruct_dpo_qlora_varlenattn.py │ ├── gemma/ │ │ ├── gemma_2b/ │ │ │ ├── gemma_2b_full_alpaca_e3.py │ │ │ └── gemma_2b_qlora_alpaca_e3.py │ │ ├── gemma_2b_it/ │ │ │ ├── gemma_2b_it_full_alpaca_e3.py │ │ │ └── gemma_2b_it_qlora_alpaca_e3.py │ │ ├── gemma_7b/ │ │ │ ├── gemma_7b_full_alpaca_e3.py │ │ │ └── gemma_7b_qlora_alpaca_e3.py │ │ └── gemma_7b_it/ │ │ ├── gemma_7b_it_full_alpaca_e3.py │ │ └── gemma_7b_it_qlora_alpaca_e3.py │ ├── internlm/ │ │ ├── internlm2_1_8b/ │ │ │ ├── internlm2_1_8b_full_alpaca_e3.py │ │ │ └── internlm2_1_8b_qlora_alpaca_e3.py │ │ ├── internlm2_20b/ │ │ │ ├── internlm2_20b_full_finetune_custom_dataset_e1.py │ │ │ ├── internlm2_20b_qlora_alpaca_e3.py │ │ │ ├── internlm2_20b_qlora_arxiv_gentitle_e3.py │ │ │ ├── internlm2_20b_qlora_code_alpaca_e3.py │ │ │ ├── internlm2_20b_qlora_colorist_e5.py │ │ │ ├── internlm2_20b_qlora_lawyer_e3.py │ │ │ ├── internlm2_20b_qlora_msagent_react_e3_gpu8.py │ │ │ ├── internlm2_20b_qlora_oasst1_512_e3.py │ │ │ ├── internlm2_20b_qlora_oasst1_e3.py │ │ │ └── internlm2_20b_qlora_sql_e3.py │ │ ├── internlm2_7b/ │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1.py │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4.py │ │ │ ├── internlm2_7b_qlora_alpaca_e3.py │ │ │ ├── internlm2_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── internlm2_7b_qlora_code_alpaca_e3.py │ │ │ ├── internlm2_7b_qlora_colorist_e5.py │ │ │ ├── internlm2_7b_qlora_json_e3.py │ │ │ ├── internlm2_7b_qlora_lawyer_e3.py │ │ │ ├── internlm2_7b_qlora_msagent_react_e3_gpu8.py │ │ │ ├── internlm2_7b_qlora_oasst1_512_e3.py │ │ │ ├── internlm2_7b_qlora_oasst1_e3.py │ │ │ ├── internlm2_7b_qlora_sql_e3.py │ │ │ ├── internlm2_7b_w_internevo_dataset.py │ │ │ ├── internlm2_7b_w_tokenized_dataset.py │ │ │ └── internlm2_7b_w_untokenized_dataset.py │ │ ├── internlm2_chat_1_8b/ │ │ │ ├── internlm2_chat_1_8b_full_alpaca_e3.py │ │ │ └── internlm2_chat_1_8b_qlora_alpaca_e3.py │ │ ├── internlm2_chat_20b/ │ │ │ ├── internlm2_chat_20b_full_finetune_custom_dataset_e1.py │ │ │ ├── internlm2_chat_20b_qlora_alpaca_e3.py │ │ │ ├── internlm2_chat_20b_qlora_code_alpaca_e3.py │ │ │ ├── internlm2_chat_20b_qlora_lawyer_e3.py │ │ │ ├── internlm2_chat_20b_qlora_oasst1_512_e3.py │ │ │ └── internlm2_chat_20b_qlora_oasst1_e3.py │ │ ├── internlm2_chat_7b/ │ │ │ ├── internlm2_chat_7b_full_finetune_custom_dataset_e1.py │ │ │ ├── internlm2_chat_7b_qlora_alpaca_e3.py │ │ │ ├── internlm2_chat_7b_qlora_code_alpaca_e3.py │ │ │ ├── internlm2_chat_7b_qlora_lawyer_e3.py │ │ │ ├── internlm2_chat_7b_qlora_oasst1_512_e3.py │ │ │ └── internlm2_chat_7b_qlora_oasst1_e3.py │ │ ├── internlm_20b/ │ │ │ ├── internlm_20b_qlora_alpaca_e3.py │ │ │ ├── internlm_20b_qlora_alpaca_enzh_e3.py │ │ │ ├── internlm_20b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── internlm_20b_qlora_alpaca_zh_e3.py │ │ │ ├── internlm_20b_qlora_arxiv_gentitle_e3.py │ │ │ ├── internlm_20b_qlora_code_alpaca_e3.py │ │ │ ├── internlm_20b_qlora_colorist_e5.py │ │ │ ├── internlm_20b_qlora_lawyer_e3.py │ │ │ ├── internlm_20b_qlora_msagent_react_e3_gpu8.py │ │ │ ├── internlm_20b_qlora_oasst1_512_e3.py │ │ │ ├── internlm_20b_qlora_oasst1_e3.py │ │ │ ├── internlm_20b_qlora_open_platypus_e3.py │ │ │ └── internlm_20b_qlora_sql_e3.py │ │ ├── internlm_7b/ │ │ │ ├── internlm_7b_full_alpaca_e3.py │ │ │ ├── internlm_7b_full_alpaca_enzh_e3.py │ │ │ ├── internlm_7b_full_alpaca_enzh_oasst1_e3.py │ │ │ ├── internlm_7b_full_alpaca_zh_e3.py │ │ │ ├── internlm_7b_full_intern_repo_dataset_template.py │ │ │ ├── internlm_7b_full_oasst1_e3.py │ │ │ ├── internlm_7b_qlora_alpaca_e3.py │ │ │ ├── internlm_7b_qlora_alpaca_enzh_e3.py │ │ │ ├── internlm_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── internlm_7b_qlora_alpaca_zh_e3.py │ │ │ ├── internlm_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── internlm_7b_qlora_code_alpaca_e3.py │ │ │ ├── internlm_7b_qlora_colorist_e5.py │ │ │ ├── internlm_7b_qlora_json_e3.py │ │ │ ├── internlm_7b_qlora_lawyer_e3.py │ │ │ ├── internlm_7b_qlora_medical_e1.py │ │ │ ├── internlm_7b_qlora_moss_sft_all_e1.py │ │ │ ├── internlm_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ ├── internlm_7b_qlora_moss_sft_plugins_e1.py │ │ │ ├── internlm_7b_qlora_msagent_react_e3_gpu8.py │ │ │ ├── internlm_7b_qlora_oasst1_512_e3.py │ │ │ ├── internlm_7b_qlora_oasst1_e3.py │ │ │ ├── internlm_7b_qlora_oasst1_e3_hf.py │ │ │ ├── internlm_7b_qlora_oasst1_mmlu_e3.py │ │ │ ├── internlm_7b_qlora_open_platypus_e3.py │ │ │ ├── internlm_7b_qlora_openorca_e1.py │ │ │ ├── internlm_7b_qlora_sql_e3.py │ │ │ └── internlm_7b_qlora_tiny_codes_e1.py │ │ ├── internlm_chat_20b/ │ │ │ ├── internlm_chat_20b_qlora_alpaca_e3.py │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_e3.py │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── internlm_chat_20b_qlora_alpaca_zh_e3.py │ │ │ ├── internlm_chat_20b_qlora_code_alpaca_e3.py │ │ │ ├── internlm_chat_20b_qlora_lawyer_e3.py │ │ │ ├── internlm_chat_20b_qlora_oasst1_512_e3.py │ │ │ ├── internlm_chat_20b_qlora_oasst1_e3.py │ │ │ └── internlm_chat_20b_qlora_open_platypus_e3.py │ │ └── internlm_chat_7b/ │ │ ├── internlm_chat_7b_qlora_alpaca_e3.py │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_e3.py │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ ├── internlm_chat_7b_qlora_alpaca_zh_e3.py │ │ ├── internlm_chat_7b_qlora_arxiv_gentitle_e3.py │ │ ├── internlm_chat_7b_qlora_code_alpaca_e3.py │ │ ├── internlm_chat_7b_qlora_colorist_e5.py │ │ ├── internlm_chat_7b_qlora_lawyer_e3.py │ │ ├── internlm_chat_7b_qlora_medical_e1.py │ │ ├── internlm_chat_7b_qlora_oasst1_512_e3.py │ │ ├── internlm_chat_7b_qlora_oasst1_e3.py │ │ ├── internlm_chat_7b_qlora_open_platypus_e3.py │ │ ├── internlm_chat_7b_qlora_openorca_e1.py │ │ ├── internlm_chat_7b_qlora_sql_e3.py │ │ └── internlm_chat_7b_qlora_tiny_codes_e1.py │ ├── llama/ │ │ ├── llama2_70b/ │ │ │ ├── llama2_70b_full_wizardlm_e1.py │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1.py │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1_hf.py │ │ │ ├── llama2_70b_qlora_open_platypus_e1.py │ │ │ └── llama2_70b_qlora_open_platypus_e1_hf.py │ │ ├── llama2_7b/ │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp1.py │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp4.py │ │ │ ├── llama2_7b_full_wizardlm_e1.py │ │ │ ├── llama2_7b_qlora_alpaca_e3.py │ │ │ ├── llama2_7b_qlora_alpaca_enzh_e3.py │ │ │ ├── llama2_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── llama2_7b_qlora_alpaca_zh_e3.py │ │ │ ├── llama2_7b_qlora_arxiv_gentitle_e3.py │ │ │ ├── llama2_7b_qlora_code_alpaca_e3.py │ │ │ ├── llama2_7b_qlora_colorist_e5.py │ │ │ ├── llama2_7b_qlora_lawyer_e3.py │ │ │ ├── llama2_7b_qlora_medical_e1.py │ │ │ ├── llama2_7b_qlora_moss_sft_all_e1.py │ │ │ ├── llama2_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ ├── llama2_7b_qlora_moss_sft_plugins_e1.py │ │ │ ├── llama2_7b_qlora_msagent_react_e3_gpu8.py │ │ │ ├── llama2_7b_qlora_oasst1_512_e3.py │ │ │ ├── llama2_7b_qlora_oasst1_e3.py │ │ │ ├── llama2_7b_qlora_open_platypus_e3.py │ │ │ ├── llama2_7b_qlora_openorca_e1.py │ │ │ ├── llama2_7b_qlora_sql_e3.py │ │ │ └── llama2_7b_qlora_tiny_codes_e1.py │ │ ├── llama2_7b_chat/ │ │ │ ├── llama2_7b_chat_qlora_alpaca_e3.py │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── llama2_7b_chat_qlora_alpaca_zh_e3.py │ │ │ ├── llama2_7b_chat_qlora_arxiv_gentitle_e3.py │ │ │ ├── llama2_7b_chat_qlora_code_alpaca_e3.py │ │ │ ├── llama2_7b_chat_qlora_colorist_e5.py │ │ │ ├── llama2_7b_chat_qlora_lawyer_e3.py │ │ │ ├── llama2_7b_chat_qlora_medical_e1.py │ │ │ ├── llama2_7b_chat_qlora_oasst1_512_e3.py │ │ │ ├── llama2_7b_chat_qlora_oasst1_e3.py │ │ │ ├── llama2_7b_chat_qlora_open_platypus_e3.py │ │ │ ├── llama2_7b_chat_qlora_openorca_e1.py │ │ │ ├── llama2_7b_chat_qlora_sql_e3.py │ │ │ └── llama2_7b_chat_qlora_tiny_codes_e1.py │ │ ├── llama3_70b_instruct/ │ │ │ └── llama3_70b_instruct_qlora_alpaca_e3_2k_gpu8.py │ │ ├── llama3_8b/ │ │ │ ├── README.md │ │ │ └── llama3_8b_full_alpaca_e3.py │ │ ├── llama3_8b_instruct/ │ │ │ ├── llama3_8b_instruct_full_alpaca_e3.py │ │ │ └── llama3_8b_instruct_qlora_alpaca_e3.py │ │ └── llama_7b/ │ │ ├── llama_7b_qlora_alpaca_e3.py │ │ ├── llama_7b_qlora_alpaca_enzh_e3.py │ │ ├── llama_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ ├── llama_7b_qlora_alpaca_zh_e3.py │ │ ├── llama_7b_qlora_arxiv_gentitle_e3.py │ │ ├── llama_7b_qlora_code_alpaca_e3.py │ │ ├── llama_7b_qlora_colorist_e5.py │ │ ├── llama_7b_qlora_lawyer_e3.py │ │ ├── llama_7b_qlora_medical_e1.py │ │ ├── llama_7b_qlora_moss_sft_all_e1.py │ │ ├── llama_7b_qlora_moss_sft_all_e2_gpu8.py │ │ ├── llama_7b_qlora_moss_sft_plugins_e1.py │ │ ├── llama_7b_qlora_oasst1_512_e3.py │ │ ├── llama_7b_qlora_oasst1_e3.py │ │ ├── llama_7b_qlora_open_platypus_e3.py │ │ ├── llama_7b_qlora_openorca_e1.py │ │ ├── llama_7b_qlora_sql_e3.py │ │ └── llama_7b_qlora_tiny_codes_e1.py │ ├── llama_speed_benchmark/ │ │ ├── llama2_70b/ │ │ │ ├── llama2_70b_full_alpaca_enzh_128k_sp8.py │ │ │ ├── llama2_70b_full_alpaca_enzh_256k_sp16.py │ │ │ ├── llama2_70b_full_alpaca_enzh_32k_sp4.py │ │ │ └── llama2_70b_full_alpaca_enzh_8k_sp1.py │ │ ├── llama2_7b/ │ │ │ ├── llama2_7b_full_alpaca_enzh_128k_sp8.py │ │ │ ├── llama2_7b_full_alpaca_enzh_1M_sp16.py │ │ │ ├── llama2_7b_full_alpaca_enzh_256k_sp8.py │ │ │ ├── llama2_7b_full_alpaca_enzh_32k_sp1.py │ │ │ └── llama2_7b_full_alpaca_enzh_8k_sp1.py │ │ └── yi_34b/ │ │ ├── yi_34b_200k_full_alpaca_enzh_128k_sp8.py │ │ ├── yi_34b_200k_full_alpaca_enzh_256k_sp8.py │ │ ├── yi_34b_200k_full_alpaca_enzh_32k_sp2.py │ │ └── yi_34b_200k_full_alpaca_enzh_8k_sp1.py │ ├── llava/ │ │ ├── README.md │ │ ├── README_zh-CN.md │ │ ├── internlm2_chat_1_8b_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ └── llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── pretrain/ │ │ │ └── llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── internlm2_chat_20b_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ ├── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ └── llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── pretrain/ │ │ │ └── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── internlm2_chat_7b_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ ├── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ └── llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── pretrain/ │ │ │ └── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── internlm_chat_7b_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ └── llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── pretrain/ │ │ │ └── llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── llama3_70b_instruct_clip_vit_large_p14_336/ │ │ │ └── pretrain/ │ │ │ └── llava_llama3_70b_instruct_quant_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ ├── llama3_8b_instruct_clip_vit_large_p14_336/ │ │ │ ├── README.md │ │ │ ├── convert_xtuner_weights_to_hf.py │ │ │ ├── convert_xtuner_weights_to_llava.py │ │ │ ├── finetune/ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py │ │ │ │ └── llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py │ │ │ └── pretrain/ │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py │ │ │ └── llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py │ │ ├── official/ │ │ │ ├── llava_v15_13b/ │ │ │ │ ├── llava_v15_13b_finetune.py │ │ │ │ ├── llava_v15_13b_finetune_lora.py │ │ │ │ └── llava_v15_13b_pretrain.py │ │ │ └── llava_v15_7b/ │ │ │ ├── llava_v15_7b_finetune.py │ │ │ ├── llava_v15_7b_finetune_lora.py │ │ │ └── llava_v15_7b_pretrain.py │ │ ├── phi3_mini_4k_instruct_clip_vit_large_p14_336/ │ │ │ ├── README.md │ │ │ ├── convert_phi_to_llama.py │ │ │ ├── convert_xtuner_weights_to_hf.py │ │ │ ├── convert_xtuner_weights_to_llava.py │ │ │ ├── finetune/ │ │ │ │ ├── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py │ │ │ │ └── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py │ │ │ └── pretrain/ │ │ │ ├── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ │ └── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py │ │ ├── vicuna_13b_v15_clip_vit_large_p14_336/ │ │ │ ├── finetune/ │ │ │ │ └── llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── pretrain/ │ │ │ └── llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ │ └── vicuna_7b_v15_clip_vit_large_p14_336/ │ │ ├── finetune/ │ │ │ ├── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py │ │ │ └── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_refcoco.py │ │ └── pretrain/ │ │ └── llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py │ ├── mistral/ │ │ ├── mistral_7b_full_finetune_custom_dataset_e1.py │ │ ├── mistral_7b_qlora_skypile_pretrain_e1.py │ │ ├── mistral_7b_w_tokenized_dataset.py │ │ └── mistral_7b_w_untokenized_dataset.py │ ├── mixtral/ │ │ ├── README.md │ │ ├── mixtral_8x7b/ │ │ │ ├── mixtral_8x7b_full_oasst1_e3.py │ │ │ └── mixtral_8x7b_qlora_oasst1_e3.py │ │ └── mixtral_8x7b_instruct/ │ │ ├── mixtral_8x7b_instruct_full_oasst1_e3.py │ │ └── mixtral_8x7b_instruct_qlora_oasst1_e3.py │ ├── orpo/ │ │ ├── internlm/ │ │ │ ├── internlm2_chat_1_8b_orpo_full.py │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn.py │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn_jsonl_dataset.py │ │ │ └── internlm2_chat_7b_orpo_qlora_varlenattn_ultrafeedback_e5.py │ │ └── llama/ │ │ └── llama3_8b_instruct_orpo_qlora_varlenattn_ultrafeedback_e5.py │ ├── phi/ │ │ └── phi3/ │ │ ├── phi3_mini_128k_instruct_full_alpaca_e3.py │ │ ├── phi3_mini_128k_instruct_qlora_alpaca_e3.py │ │ ├── phi3_mini_4k_instruct_full_alpaca_e3.py │ │ └── phi3_mini_4k_instruct_qlora_alpaca_e3.py │ ├── qwen/ │ │ ├── qwen1/ │ │ │ ├── qwen_1_8b/ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_e3.py │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── qwen_1_8b_qlora_alpaca_zh_e3.py │ │ │ │ └── qwen_1_8b_qlora_code_alpaca_e3.py │ │ │ ├── qwen_1_8b_chat/ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_e3.py │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_e3.py │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_zh_e3.py │ │ │ │ └── qwen_1_8b_chat_qlora_code_alpaca_e3.py │ │ │ ├── qwen_72b/ │ │ │ │ ├── qwen_72b_qlora_alpaca_e3.py │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── qwen_72b_qlora_alpaca_zh_e3.py │ │ │ │ └── qwen_72b_qlora_code_alpaca_e3.py │ │ │ ├── qwen_7b/ │ │ │ │ ├── qwen_7b_qlora_alpaca_e3.py │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_e3.py │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_oasst1_e3.py │ │ │ │ ├── qwen_7b_qlora_alpaca_zh_e3.py │ │ │ │ ├── qwen_7b_qlora_arxiv_gentitle_e3.py │ │ │ │ ├── qwen_7b_qlora_code_alpaca_e3.py │ │ │ │ ├── qwen_7b_qlora_colorist_e5.py │ │ │ │ ├── qwen_7b_qlora_lawyer_e3.py │ │ │ │ ├── qwen_7b_qlora_medical_e1.py │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e1.py │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e2_gpu8.py │ │ │ │ ├── qwen_7b_qlora_moss_sft_plugins_e1.py │ │ │ │ ├── qwen_7b_qlora_oasst1_512_e3.py │ │ │ │ ├── qwen_7b_qlora_oasst1_e3.py │ │ │ │ ├── qwen_7b_qlora_open_platypus_e3.py │ │ │ │ ├── qwen_7b_qlora_openorca_e1.py │ │ │ │ ├── qwen_7b_qlora_sql_e3.py │ │ │ │ └── qwen_7b_qlora_tiny_codes_e1.py │ │ │ └── qwen_7b_chat/ │ │ │ ├── qwen_7b_chat_qlora_alpaca_e3.py │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_e3.py │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_oasst1_e3.py │ │ │ ├── qwen_7b_chat_qlora_alpaca_zh_e3.py │ │ │ ├── qwen_7b_chat_qlora_arxiv_gentitle_e3.py │ │ │ ├── qwen_7b_chat_qlora_code_alpaca_e3.py │ │ │ ├── qwen_7b_chat_qlora_colorist_e5.py │ │ │ ├── qwen_7b_chat_qlora_lawyer_e3.py │ │ │ ├── qwen_7b_chat_qlora_medical_e1.py │ │ │ ├── qwen_7b_chat_qlora_oasst1_512_e3.py │ │ │ ├── qwen_7b_chat_qlora_oasst1_e3.py │ │ │ ├── qwen_7b_chat_qlora_open_platypus_e3.py │ │ │ ├── qwen_7b_chat_qlora_openorca_e1.py │ │ │ ├── qwen_7b_chat_qlora_sql_e3.py │ │ │ └── qwen_7b_chat_qlora_tiny_codes_e1.py │ │ └── qwen1_5/ │ │ ├── qwen1_5_0_5b/ │ │ │ ├── qwen1_5_0_5b_full_alpaca_e3.py │ │ │ └── qwen1_5_0_5b_qlora_alpaca_e3.py │ │ ├── qwen1_5_0_5b_chat/ │ │ │ ├── qwen1_5_0_5b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_0_5b_chat_qlora_alpaca_e3.py │ │ ├── qwen1_5_110b/ │ │ │ ├── qwen1_5_110b_full_alpaca_e3.py │ │ │ └── qwen1_5_110b_qlora_alpaca_e3.py │ │ ├── qwen1_5_110b_chat/ │ │ │ ├── README.md │ │ │ ├── qwen1_5_110b_chat_full_alpaca_e3.py │ │ │ ├── qwen1_5_110b_chat_qlora_alpaca_e3.py │ │ │ └── qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py │ │ ├── qwen1_5_14b/ │ │ │ ├── qwen1_5_14b_full_alpaca_e3.py │ │ │ └── qwen1_5_14b_qlora_alpaca_e3.py │ │ ├── qwen1_5_14b_chat/ │ │ │ ├── qwen1_5_14b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_14b_chat_qlora_alpaca_e3.py │ │ ├── qwen1_5_1_8b/ │ │ │ ├── qwen1_5_1_8b_full_alpaca_e3.py │ │ │ └── qwen1_5_1_8b_qlora_alpaca_e3.py │ │ ├── qwen1_5_1_8b_chat/ │ │ │ ├── qwen1_5_1_8b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_1_8b_chat_qlora_alpaca_e3.py │ │ ├── qwen1_5_4b/ │ │ │ ├── qwen1_5_4b_full_alpaca_e3.py │ │ │ └── qwen1_5_4b_qlora_alpaca_e3.py │ │ ├── qwen1_5_4b_chat/ │ │ │ ├── qwen1_5_4b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_4b_chat_qlora_alpaca_e3.py │ │ ├── qwen1_5_72b/ │ │ │ ├── qwen1_5_72b_full_alpaca_e3.py │ │ │ └── qwen1_5_72b_qlora_alpaca_e3.py │ │ ├── qwen1_5_72b_chat/ │ │ │ ├── qwen1_5_72b_chat_full_alpaca_e3.py │ │ │ └── qwen1_5_72b_chat_qlora_alpaca_e3.py │ │ ├── qwen1_5_7b/ │ │ │ ├── qwen1_5_7b_full_alpaca_e3.py │ │ │ └── qwen1_5_7b_qlora_alpaca_e3.py │ │ └── qwen1_5_7b_chat/ │ │ ├── qwen1_5_7b_chat_full_alpaca_e3.py │ │ └── qwen1_5_7b_chat_qlora_alpaca_e3.py │ ├── qwen_moe/ │ │ └── qwen1_5/ │ │ └── qwen1_5_moe_a2_7_b_chat/ │ │ └── qwen1_5_moe_a2_7_b_chat_full_alpaca_e3.py │ ├── reward_model/ │ │ ├── internlm/ │ │ │ ├── internlm2_chat_1_8b_reward_full_ultrafeedback.py │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_ultrafeedback.py │ │ │ └── internlm2_chat_1_8b_reward_qlora_varlenattn_ultrafeedback.py │ │ └── llama/ │ │ └── llama3_8b_instruct_reward_full_varlenattn_ultrafeedback.py │ ├── starcoder/ │ │ └── starcoder_qlora_stack_exchange_example.py │ ├── yi/ │ │ ├── yi_34b/ │ │ │ └── yi_34b_qlora_alpaca_enzh_e3.py │ │ └── yi_6b/ │ │ └── yi_6b_qlora_alpaca_enzh_e3.py │ └── zephyr/ │ └── zephyr_7b_beta_qlora_alpaca_e3.py ├── dataset/ │ ├── __init__.py │ ├── collate_fns/ │ │ ├── __init__.py │ │ ├── default_collate_fn.py │ │ ├── mmlu_collate_fn.py │ │ └── preference_collate_fn.py │ ├── concat_dataset.py │ ├── huggingface.py │ ├── intern_repo.py │ ├── json_dataset.py │ ├── llava.py │ ├── map_fns/ │ │ ├── __init__.py │ │ ├── dataset_map_fns/ │ │ │ ├── __init__.py │ │ │ ├── alpaca_map_fn.py │ │ │ ├── alpaca_zh_map_fn.py │ │ │ ├── arxiv_map_fn.py │ │ │ ├── code_alpaca_map_fn.py │ │ │ ├── colors_map_fn.py │ │ │ ├── crime_kg_assitant_map_fn.py │ │ │ ├── default_map_fn.py │ │ │ ├── law_reference_map_fn.py │ │ │ ├── llava_map_fn.py │ │ │ ├── medical_map_fn.py │ │ │ ├── msagent_map_fn.py │ │ │ ├── oasst1_map_fn.py │ │ │ ├── openai_map_fn.py │ │ │ ├── openorca_map_fn.py │ │ │ ├── pretrain_map_fn.py │ │ │ ├── sql_map_fn.py │ │ │ ├── stack_exchange_map_fn.py │ │ │ ├── tiny_codes_map_fn.py │ │ │ └── wizardlm_map_fn.py │ │ └── template_map_fn.py │ ├── modelscope.py │ ├── moss_sft.py │ ├── preference_dataset.py │ ├── refcoco_json.py │ ├── samplers/ │ │ ├── __init__.py │ │ ├── intern_repo.py │ │ └── length_grouped.py │ └── utils.py ├── engine/ │ ├── __init__.py │ ├── _strategy/ │ │ ├── __init__.py │ │ └── deepspeed.py │ ├── hooks/ │ │ ├── __init__.py │ │ ├── dataset_info_hook.py │ │ ├── evaluate_chat_hook.py │ │ ├── hf_checkpoint_hook.py │ │ ├── throughput_hook.py │ │ └── varlen_attn_args_to_messagehub_hook.py │ └── runner/ │ ├── __init__.py │ └── loops.py ├── entry_point.py ├── evaluation/ │ ├── __init__.py │ └── metrics/ │ ├── __init__.py │ ├── mmlu_metric.py │ └── reward_metric.py ├── model/ │ ├── __init__.py │ ├── dpo.py │ ├── llava.py │ ├── modules/ │ │ ├── __init__.py │ │ ├── dispatch/ │ │ │ ├── __init__.py │ │ │ ├── attention.py │ │ │ ├── baichuan.py │ │ │ ├── cohere.py │ │ │ ├── deepseek_v2.py │ │ │ ├── internlm.py │ │ │ ├── internlm2.py │ │ │ ├── llama.py │ │ │ ├── mistral.py │ │ │ ├── phi3.py │ │ │ ├── qwen2.py │ │ │ ├── triton_kernels/ │ │ │ │ ├── __init__.py │ │ │ │ ├── layer_norm.py │ │ │ │ ├── rms_norm.py │ │ │ │ └── rotary.py │ │ │ ├── utils.py │ │ │ └── yi.py │ │ └── projector/ │ │ ├── __init__.py │ │ ├── configuration_projector.py │ │ └── modeling_projector.py │ ├── orpo.py │ ├── reward.py │ ├── sft.py │ ├── transformers_models/ │ │ ├── __init__.py │ │ ├── deepseek_v2/ │ │ │ ├── __init__.py │ │ │ ├── configuration_deepseek.py │ │ │ ├── modeling_deepseek.py │ │ │ └── tokenization_deepseek_fast.py │ │ └── mixtral/ │ │ ├── __init__.py │ │ ├── configuration_mixtral.py │ │ └── modeling_mixtral.py │ └── utils.py ├── parallel/ │ ├── __init__.py │ └── sequence/ │ ├── __init__.py │ ├── attention.py │ ├── comm.py │ ├── data_collate.py │ ├── reduce_loss.py │ ├── sampler.py │ └── setup_distributed.py ├── registry.py ├── tools/ │ ├── chat.py │ ├── check_custom_dataset.py │ ├── copy_cfg.py │ ├── data_preprocess/ │ │ ├── arxiv.py │ │ └── convert_refcoco.py │ ├── eval_refcoco.py │ ├── get_data_order.py │ ├── list_cfg.py │ ├── list_dataset_format.py │ ├── log_dataset.py │ ├── mmbench.py │ ├── model_converters/ │ │ ├── merge.py │ │ ├── modeling_internlm2_reward/ │ │ │ ├── __init__.py │ │ │ ├── configuration_internlm2.py │ │ │ └── modeling_internlm2.py │ │ ├── pth_to_hf.py │ │ └── split.py │ ├── plugins/ │ │ ├── __init__.py │ │ ├── api.py │ │ ├── calculate.py │ │ ├── search.py │ │ └── solve.py │ ├── process_untokenized_datasets.py │ ├── process_untokenized_datasets_legacy.py │ ├── process_untokenized_llava_data.py │ ├── test.py │ ├── tokenize_ftdp_datasets.py │ ├── train.py │ └── utils.py ├── utils/ │ ├── __init__.py │ ├── constants.py │ ├── fileio.py │ ├── handle_moe_load_and_save.py │ ├── stop_criteria.py │ ├── templates.py │ └── zero_to_any_dtype.py └── version.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitattributes ================================================ # Auto detect text files and perform LF normalization logs * text=auto ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2025 Yi Wang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

[Xinhao Li](https://scholar.google.com/citations?user=evR3uR0AAAAJ), [Yi Wang](https://scholar.google.com.hk/citations?user=Xm2M8UwAAAAJ), [Jiashuo Yu](https://scholar.google.com.hk/citations?user=iH0Aq0YAAAAJ&oi=ao), [Xiangyu Zeng](https://scholar.google.com/citations?user=jS13DXkAAAAJ&hl=zh-CN), Yuhan Zhu, Haian Huang, Jianfei Gao, [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ), [Yinan He](https://dblp.org/pid/93/7763.html), Chenting Wang, [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl), [Yali Wang](https://scholar.google.com/citations?user=hD948dkAAAAJ), and [Limin Wang](https://scholar.google.com/citations?user=HEuN8PcAAAAJ)

🤗 Model & Data    |   🖥️ Demo    |    📑 Paper    |    🌐 Blog

## :fire: Updates - [x] **2025/06/13**: 🎉🎉🎉Our model achieves promising results on the [VideoEval-Pro](https://arxiv.org/abs/2505.14640) benchmark focused on long video understanding! - [x] **2025/05/10**:🔥🔥🔥 We release most video of our [training data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), Hope it can be of help to you! - [x] **2025/03/27**:🔥🔥 We release our dataset and evaluation codes for single-hop and multi-hop needle-in-a-haystack! - [x] **2025/03/09**:🔥🔥 We release our weights of each training stage [here](https://github.com/OpenGVLab/VideoChat-Flash/blob/main/llava-train_videochat/README.), try to build your VideoChat-Flash on them! - [x] **2025/02/25**:🔥🔥 We release our [training data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), [training codes based LLaVA](llava-train_videochat) for VideoChat-Flash and [training codes based XTuner](xtuner-train_internvideo2_5) for finetuning InternVideo2.5. - [x] **2025/02/12**: 🎉🎉🎉Our VideoChat-Flash-7B@448 has achieved first place on the latest Video Detail Caption Benchmark, [AuroraCap](https://rese1f.github.io/aurora-web/). - [x] **2025/01/15**: We provide [evaluation codes](lmms-eval_videochat) for QA & Grounding Benchmark. - [x] **2025/01/12**: 🔥🔥🔥Release **VideoChat2-Flash**, a powerfull MLLM built on video encoder ([InternVideo](https://github.com/OpenGVLab/InternVideo)) and LLM ([Qwen](https://github.com/QwenLM/Qwen)). - We offer five models, [VideoChat2-Flash-2B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448) (Small LLM), [VideoChat2-Flash-7B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res224), [VideoChat2-Flash-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448) (Overall best), [VideoChat-Flash-Qwen2_5-7B-1M](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224) (Super long video input) and [VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) (Stronger short-term temporal understanding). ## 📑 Future Plan - [ ] lmdeploy/vllm support for Videochat-Flash and InternVideo2.5 - [ ] LoRA finetuning training code for Videochat-Flash and InternVideo2.5 - [ ] Mixing image/video training code for InternVideo2.5 - [ ] Faster training code with XTuner for VideoChat-Flash As I am currently very busy with work and find it difficult to complete the above plans quickly, I sincerely ask friends in the community to join in and **submit a PR**. ## :parrot: Introduction **🚀State-of-the-art performance** in short and long video understanding, with temporal localization capabilities comparable to expert models. ![alt text](img/sota.png) **🔭Supports ultra-long video inputs**, achieving a groundbreaking needle-in-a-haystack evaluation accuracy of **99.1% on 10,000 frames**, capable of processing videos up to three hours long. ![alt text](img/niah.png) **⚡Highly efficient model architecture** with exceptional inference speed, encoding each video frame into just **16 tokens**, making it **5–10** times faster than the previous model. ![alt text](img/model_framework.png) ## Demo & Inference Refer to [hf README](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448) to inference our model. ## Evaluation See [evaluation codes](lmms-eval_videochat). And [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) have supported our model, you also could use it to evaluate our model on varous benchmarks. ## Training See [training codes based LLaVA](llava-train_videochat) for VideoChat-Flash and [training codes based XTuner](xtuner-train_internvideo2_5) for finetuning InternVideo2.5. ## :bar_chart: [NIAH](./BENCHMARK.md) ![alt text](img/mhniah.png) See [xtuner-eval_niah](xtuner-eval_niah) for evaluation of Single-Hop NIAH-Video and Multi-Hop NIAH-Video. # :page_facing_up: Citation If you find this project useful in your research, please consider cite: ```BibTeX @article{li2024videochat, title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling}, author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and Qiao, Yu and Wang, Yali and Wang, Limin}, journal={arXiv preprint arXiv:2501.00574}, year={2024} } ``` # :dizzy: Acknowledgement Thanks to the open source of the following projects: [InternVideo](https://github.com/OpenGVLab/InternVideo), [UMT](https://github.com/OpenGVLab/unmasked_teacher), [Qwen](https://github.com/QwenLM/Qwen), [LLaVA-VL](https://github.com/LLaVA-VL/LLaVA-NeXT), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [Ask-Anything](https://github.com/OpenGVLab/Ask-Anything), [ToMe](https://github.com/facebookresearch/ToMe), [LongVLM](https://github.com/ziplab/LongVLM), [FastV](https://github.com/pkunlp-icler/FastV), [LLaVolta](https://github.com/Beckschen/LLaVolta), [PyramidDrop](https://github.com/Cooperx521/PyramidDrop), [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), their implementation provides valuable reference experience for our project. ================================================ FILE: llava-train_videochat/.dockerignore ================================================ # The .dockerignore file excludes files from the container build process. # # https://docs.docker.com/engine/reference/builder/#dockerignore-file # Exclude Git files .git .github .gitignore # Exclude Python cache files __pycache__ .mypy_cache .pytest_cache .ruff_cache # Exclude Python virtual environment /venv # Exclude some weights /openai /liuhaotian ================================================ FILE: llava-train_videochat/.editorconfig ================================================ root = true # Unix-style newlines with a newline ending every file [*] end_of_line = lf insert_final_newline = true trim_trailing_whitespace = true charset = utf-8 # 4 space indentation [*.{py,json}] indent_style = space indent_size = 4 # 2 space indentation [*.{md,sh,yaml,yml}] indent_style = space indent_size = 2 ================================================ FILE: llava-train_videochat/.gitattributes ================================================ # https://git-scm.com/docs/gitattributes # Set the default behavior, in case people don't have core.autocrlf set. # https://git-scm.com/docs/gitattributes#_end_of_line_conversion * text=auto # common python attributes, taken from https://github.com/alexkaratarakis/gitattributes/blob/710900479a2bedeec7003d381719521ffbb18bf8/Python.gitattributes # Source files # ============ *.pxd text diff=python *.py text diff=python *.py3 text diff=python *.pyw text diff=python *.pyx text diff=python *.pyz text diff=python *.pyi text diff=python # Binary files # ============ *.db binary *.p binary *.pkl binary *.pickle binary *.pyc binary export-ignore *.pyo binary export-ignore *.pyd binary # Jupyter notebook *.ipynb text eol=lf ================================================ FILE: llava-train_videochat/.gitignore ================================================ # Python __pycache__ *.pyc *.egg-info dist # Log *.log *.log.* # *.json # *.jsonl # Data !**/alpaca-data-conversation.json # Editor .idea *.swp .vscode # Other .DS_Store wandb output llavavid checkpoints project_checkpoints debug_checkpoints playground/data playground/cc3m_llava34b_cap ckpts* .ipynb_checkpoints chunyl_scripts *.ipynb # DevContainer !.devcontainer/* # Demo serve_images/ notebooks/ logs scripts/dist_* logs/ submissions/ cn_scripts/ internal_project_checkpoints/ work_dirs scripts/i18n/* playground/.nfs028b000000010add00000001 HIP playground/.nfs028b0000017bff2c00000012 scripts/qwen scripts/vicuna scripts/mistral scripts/baseline_rep scripts/cn_boli01_hl scripts/cn_boli01_lf scripts/cn_lf scripts/cn_lq scripts/cn_yg scripts/cn_yg_hao scripts/eva_encoder scripts/i18n scripts/i18n_higher_res scripts/multi-images scratchpad build/ playground/*.json mlx_configs/ data_processing/ # demo/ ================================================ FILE: llava-train_videochat/LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: llava-train_videochat/README.md ================================================ # 👀How to train and evaluate VideoChat-Flash?🦜 ## 1. Prepare Training Data We need to address the fact that our data has been collected and used in different projects/people. For the data that has already been uploaded, we will refer you the corresponding viewing locations. Please collect relevant data fragments and integrate them in your own environments. We use similar data format with [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train). ***You can customize your own training data in this format***. In [data](.data), we have provided the data used in each training stage, along with the corresponding annotation locations. We have made all the data annotations and some of the videos available on [OpenGVLab/VideoChat-Flash-Training-Data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), and I have listed all video source url in the annotation file. ## 2. Training | Stage | Num. frames | ViT | Connector | LLM | CKPT | |--------|:-------:|:------:|:------:|:------:|:------:| | [stage1](scripts/train/stage1-init_connector) | 4 | :snowflake: | :fire: | :snowflake: | [all projector weights](https://huggingface.co/OpenGVLab/stage1-mm-projectors/tree/main) | | [stage2](scripts/train/stage2-visual_pretraining) | 4-8 | :fire: | :fire: | :fire: | [UMT-Qwen2_7B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2-7B-tome16_mlp), [UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2_5_7B_1m-tome16_mlp), [UMT-HD-Qwen2_5_2B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2_5_1.5B-tome16_mlp), [InternVideo2-Qwen2_5_7B](https://huggingface.co/OpenGVLab/stage2-InternVideo2-1B-Qwen2_5-7B-tome16_mlp) | | [stage3](scripts/train/stage3-video_sft) | 64-512 | :fire: | :fire: | :fire: | [UMT-Qwen2_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448),[UMT-HD-Qwen2_5-2B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448),[UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224), [InternVideo2-Qwen2_5_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) | | [stage4](scripts/train/stage4_highres_postft) | 64-512 | :fire: | :fire: | :snowflake: | [UMT-HD-Qwen2-7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448)| Training time with a 32 A100: - stage1: under one hour: - stage2: about 2 day - stage3: about 2~3day - stage4: about 2~3day ### Tips - ***We recommend to start from stage3 based on our provided stage2 model to save training cost, and you could use [1/4 stage3 data](data/ablation_short-long_mix_sft.yaml) for ablation (as we do)! You also could ignore stage4 if you don't need a absolute SoTA performance!*** - We use slurm to train model on multple machines, **if you only have one machines or you don't use slurm**, please refer to [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/finetune_ov.sh) to modify the scripts. - If you try to finetuning [UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224), modify [`max_position_embeddings`](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224/blob/main/config.json#L185) to smaller value like 32768 to avoid Cuda OOM! ### Install ```bash git clone https://github.com/OpenGVLab/VideoChat-Flash cd llava-train_videochat pip install -e . ``` ### Stage-1: Video-Language Alignment Please download pretrained video encoders in [Huggingfaces](https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash) first. Then modify ckpt_path in `build_vit` of `llava/model/multimodal_encoder/umt_encoder.py` or `llava/model/multimodal_encoder/internvideo2_encoder.py`. ```bash bash scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh ``` ### Stage-2: Short Video Pre-training ```bash bash scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh ``` ### Stage-3: Joint Short & Long Video Instruction Tuning ```bash bash scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh ``` ### Stage-4: Efficient High-Resolution Post-finetuning Please modify `vision_tower="umt-hd-large"` in `Your_stage3_checkpoint_path/config.json` first! ```bash bash scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh ``` ## Evaluation Overwrite your checkpoints directory with the configurations (json) and Python files from OpenGVLab/VideoChat-Flash, and then you can use the lmms-eval_videochat we provided for evaluation. ================================================ FILE: llava-train_videochat/cog.yaml ================================================ # Configuration for Cog ⚙️ # Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md build: gpu: true python_version: "3.11" python_packages: - "torch==2.0.1" - "accelerate==0.21.0" - "bitsandbytes==0.41.0" - "deepspeed==0.9.5" - "einops-exts==0.0.4" - "einops==0.6.1" - "gradio==3.35.2" - "gradio_client==0.2.9" - "httpx==0.24.0" - "markdown2==2.4.10" - "numpy==1.26.0" - "peft==0.4.0" - "scikit-learn==1.2.2" - "sentencepiece==0.1.99" - "shortuuid==1.0.11" - "timm==0.6.13" - "tokenizers==0.13.3" - "torch==2.0.1" - "torchvision==0.15.2" - "transformers==4.31.0" - "wandb==0.15.12" - "wavedrom==2.0.3.post3" - "Pygments==2.16.1" run: - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pget # predict.py defines how predictions are run on your model predict: "predict.py:Predictor" ================================================ FILE: llava-train_videochat/data/ablation_short-long_mix_sft.yaml ================================================ datasets: # image sft datasets - json_path: annotations/image/textcaps.json # 21942 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps - json_path: annotations/image/textocr(gpt4v).json # 25104 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v) - json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron) - json_path: annotations/image/iam(cauldron)_fix.json # 5658 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron) - json_path: annotations/image/llavar_gpt4_20k.json # 19790 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k - json_path: annotations/image/allava_instruct_vflan4v.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v - json_path: annotations/image/allava_instruct_laion4v.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v - json_path: annotations/image/sharegpt4o.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o - json_path: annotations/image/sharegpt4v(coco).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco) - json_path: annotations/image/sharegpt4v(knowledge).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge) - json_path: annotations/image/sharegpt4v(llava).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava) - json_path: annotations/image/sharegpt4v(sam).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam) - json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675 sampling_strategy: "first:10%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680 - json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247 - json_path: annotations/image/llava_next_raw_format_processed_738k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data - json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data # video sft datasets - json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json sampling_strategy: "first:20%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/caption_sharegemini_k400_223k.json sampling_strategy: "first:25%" data_root: https://opendatalab.com/OpenMMLab/Kinetics-400 - json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json sampling_strategy: "first:25%" data_root: http://youcook2.eecs.umich.edu/ - json_path: annotations/video/caption_textvr-textvr-train_40k.json sampling_strategy: "first:25%" data_root: https://github.com/callsys/TextVR - json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/caption_favd-favd-train_10k.json sampling_strategy: "first:25%" data_root: https://github.com/OpenNLPLab/FAVDBench - json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json sampling_strategy: "first:25%" data_root: https://sharegpt4o.github.io/ - json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl sampling_strategy: "first:25%" data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html video_read_type: img - json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl sampling_strategy: "first:25%" data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl sampling_strategy: "first:25%" data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/classification_k710-k710-train_40k.jsonl sampling_strategy: "first:25%" - json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset - json_path: annotations/video/lsmdc-lsmdc_297k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/site/describingmovies/ - json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json sampling_strategy: "first:25%" data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/ - json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json sampling_strategy: "first:25%" data_root: https://github.com/google-deepmind/perception_test - json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl sampling_strategy: "first:25%" data_root: https://ego4d-data.org/ - json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/grounding_didemo-didemo-train_66k.json sampling_strategy: "first:25%" data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json sampling_strategy: "first:25%" data_root: https://opendatalab.com/OpenMMLab/Kinetics_700 - json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json sampling_strategy: "first:25%" data_root: https://oops.cs.columbia.edu/ - json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json sampling_strategy: "first:25%" data_root: https://songbai.site/ovis/ video_read_type: img - json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/reasoning_star-star-train_46k.json sampling_strategy: "first:25%" data_root: https://bobbywu.com/STAR/ - json_path: annotations/video/vcg-plus_112K_clean_97k.json sampling_strategy: "first:10%" data_root: http://activity-net.org/ - json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/guiworld_en_20241029_fix.jsonl sampling_strategy: "first:25%" data_root: https://gui-world.github.io/ ## llava video - json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json sampling_strategy: "first:25%" data_root: https://coin-dataset.github.io/ - json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json sampling_strategy: "first:25%" data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json sampling_strategy: "first:25%" data_root: https://hirest-cvpr2023.github.io/ - json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json sampling_strategy: "first:25%" data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/ - json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json sampling_strategy: "first:25%" data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT - json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset ================================================ FILE: llava-train_videochat/data/stage1_init_connector_iv1m.yaml ================================================ datasets: - json_path: OpenGVLab/VideoChat-Flash-Training-Data/annotations/video/smit_caption_481k.json sampling_strategy: all data_root: http://moments.csail.mit.edu/spoken.html - json_path: OpenGVLab/VideoChat-Flash-Training-Data/annotations/image/blip_laion_cc_sbu_558k.json sampling_strategy: all data_root: https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain ================================================ FILE: llava-train_videochat/data/stage2_short_pretrain_iv6m.yaml ================================================ datasets: - json_path: annotations/image/LLaVA-ReCap-118K.json sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-118K - json_path: annotations/image/LLaVA-ReCap-CC3M.json sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-CC3M - json_path: annotations/image/LLaVA-ReCap-558K.json sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-558K - json_path: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/evol_instruct/evol_instruct_processed.json sampling_strategy: all - json_path: annotations/video/webvid-fuse_caption_2m.json sampling_strategy: all data_root: https://github.com/m-bain/webvid - json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json sampling_strategy: all data_root: https://github.com/m-bain/webvid - json_path: annotations/video/caption_sharegemini_k400_223k.json sampling_strategy: all data_root: https://opendatalab.com/OpenMMLab/Kinetics-400 - json_path: annotations/image/ureader_tr_processed.json data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/ureader_ur/ sampling_strategy: all - json_path: annotations/image/synthdog_zh_processed.json data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/synthdog_zh/synthdog_zh_images/ sampling_strategy: all - json_path: annotations/image/synthdog_en_processed.json data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/synthdog_en/synthdog_en_images/ sampling_strategy: all - json_path: annotations/video/smit_caption_481k.json sampling_strategy: all data_root: http://moments.csail.mit.edu/spoken.html - json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json sampling_strategy: all data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img ================================================ FILE: llava-train_videochat/data/stage3_short-long_mix_sft.yaml ================================================ datasets: # image sft datasets - json_path: annotations/image/textcaps.json # 21942 sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps - json_path: annotations/image/textocr(gpt4v).json # 25104 sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v) - json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995 sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron) - json_path: annotations/image/iam(cauldron)_fix.json # 5658 sampling_strategy: all data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron) - json_path: annotations/image/llavar_gpt4_20k.json # 19790 sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k - json_path: annotations/image/allava_instruct_vflan4v.json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v - json_path: annotations/image/allava_instruct_laion4v.json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v - json_path: annotations/image/sharegpt4o.json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o - json_path: annotations/image/sharegpt4v(coco).json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco) - json_path: annotations/image/sharegpt4v(knowledge).json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge) - json_path: annotations/image/sharegpt4v(llava).json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava) - json_path: annotations/image/sharegpt4v(sam).json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam) - json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675 sampling_strategy: "first:10%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680 - json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242 sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247 - json_path: annotations/image/llava_next_raw_format_processed_738k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data - json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data # video sft datasets - json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json sampling_strategy: "first:20%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/caption_sharegemini_k400_223k.json sampling_strategy: "all" data_root: https://opendatalab.com/OpenMMLab/Kinetics-400 - json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json sampling_strategy: "all" data_root: http://youcook2.eecs.umich.edu/ - json_path: annotations/video/caption_textvr-textvr-train_40k.json sampling_strategy: "all" data_root: https://github.com/callsys/TextVR - json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json sampling_strategy: "all" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/caption_favd-favd-train_10k.json sampling_strategy: "first:25%" data_root: https://github.com/OpenNLPLab/FAVDBench - json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json sampling_strategy: all data_root: https://sharegpt4o.github.io/ - json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl sampling_strategy: "all" data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html video_read_type: img - json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl sampling_strategy: all data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl sampling_strategy: all data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl sampling_strategy: all data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/classification_k710-k710-train_40k.jsonl sampling_strategy: "first:25%" - json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset - json_path: annotations/video/lsmdc-lsmdc_297k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/site/describingmovies/ - json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json sampling_strategy: "first:25%" data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/ - json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json sampling_strategy: all data_root: https://github.com/google-deepmind/perception_test - json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl sampling_strategy: "all" data_root: https://ego4d-data.org/ - json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl sampling_strategy: "all" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl sampling_strategy: "all" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl sampling_strategy: all data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/grounding_didemo-didemo-train_66k.json sampling_strategy: all data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json sampling_strategy: all data_root: https://opendatalab.com/OpenMMLab/Kinetics_700 - json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json sampling_strategy: all data_root: https://oops.cs.columbia.edu/ - json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json sampling_strategy: all data_root: https://songbai.site/ovis/ video_read_type: img - json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json sampling_strategy: all data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json sampling_strategy: all data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/reasoning_star-star-train_46k.json sampling_strategy: all data_root: https://bobbywu.com/STAR/ - json_path: annotations/video/vcg-plus_112K_clean_97k.json sampling_strategy: "first:10%" data_root: http://activity-net.org/ - json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl sampling_strategy: all data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/guiworld_en_20241029_fix.jsonl sampling_strategy: "all" data_root: https://gui-world.github.io/ ## llava video - json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl sampling_strategy: "all" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json sampling_strategy: all data_root: http://activity-net.org//train - json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json sampling_strategy: all data_root: https://coin-dataset.github.io/ - json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json sampling_strategy: all data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json sampling_strategy: all data_root: https://hirest-cvpr2023.github.io/ - json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json sampling_strategy: all data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/ - json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json sampling_strategy: all data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT - json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json sampling_strategy: all data_root: http://activity-net.org//train - json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json sampling_strategy: all data_root: http://activity-net.org//train - json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json sampling_strategy: all video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json sampling_strategy: all video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json sampling_strategy: all video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json sampling_strategy: all video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset ================================================ FILE: llava-train_videochat/data/stage4_highres_postsft.yaml ================================================ datasets: # image sft datasets, 6w - json_path: annotations/image/synthdog_zh_processed.json data_root: https://huggingface.co/datasets/lmms-lab/OneVision-Mid-Data/synthdog_zh/synthdog_zh_images/ sampling_strategy: "first:10%" - json_path: annotations/image/synthdog_en_processed.json data_root: https://huggingface.co/datasets/lmms-lab/OneVision-Mid-Data/synthdog_en/synthdog_en_images/ sampling_strategy: "first:10%" - json_path: annotations/image/textcaps.json # 21942 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps - json_path: annotations/image/textocr(gpt4v).json # 25104 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v) - json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron) - json_path: annotations/image/iam(cauldron)_fix.json # 5658 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron) - json_path: annotations/image/llavar_gpt4_20k.json # 19790 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k - json_path: annotations/image/allava_instruct_vflan4v.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v - json_path: annotations/image/allava_instruct_laion4v.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v - json_path: annotations/image/sharegpt4o.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o - json_path: annotations/image/sharegpt4v(coco).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco) - json_path: annotations/image/sharegpt4v(knowledge).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge) - json_path: annotations/image/sharegpt4v(llava).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava) - json_path: annotations/image/sharegpt4v(sam).json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam) - json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675 sampling_strategy: "first:10%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680 - json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242 sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247 - json_path: annotations/image/llava_next_raw_format_processed_738k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data - json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data # video sft datasets - json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json sampling_strategy: "first:20%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/caption_sharegemini_k400_223k.json sampling_strategy: "first:25%" data_root: https://opendatalab.com/OpenMMLab/Kinetics-400 - json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json sampling_strategy: "first:25%" data_root: http://youcook2.eecs.umich.edu/ - json_path: annotations/video/caption_textvr-textvr-train_40k.json sampling_strategy: "first:25%" data_root: https://github.com/callsys/TextVR - json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/caption_favd-favd-train_10k.json sampling_strategy: "first:25%" data_root: https://github.com/OpenNLPLab/FAVDBench - json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json sampling_strategy: "first:25%" data_root: https://sharegpt4o.github.io/ - json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl sampling_strategy: "first:25%" data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html video_read_type: img - json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl sampling_strategy: "first:25%" data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl sampling_strategy: "first:25%" data_root: http://clevrer.csail.mit.edu/ - json_path: annotations/video/classification_k710-k710-train_40k.jsonl sampling_strategy: "first:25%" - json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset - json_path: annotations/video/lsmdc-lsmdc_297k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/site/describingmovies/ - json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json sampling_strategy: "first:25%" data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/ - json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json sampling_strategy: "first:25%" data_root: https://github.com/google-deepmind/perception_test - json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl sampling_strategy: "first:25%" data_root: https://ego4d-data.org/ - json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl sampling_strategy: "first:25%" data_root: https://github.com/YunseokJANG/tgif-qa video_read_type: gif - json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/doc-doc/NExT-QA - json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl sampling_strategy: "first:25%" data_root: https://github.com/m-bain/webvid - json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train - json_path: annotations/video/grounding_didemo-didemo-train_66k.json sampling_strategy: "first:25%" data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k video_read_type: img - json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json sampling_strategy: "first:25%" data_root: https://opendatalab.com/OpenMMLab/Kinetics_700 - json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json sampling_strategy: "first:25%" data_root: https://oops.cs.columbia.edu/ - json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json sampling_strategy: "first:25%" data_root: https://songbai.site/ovis/ video_read_type: img - json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json sampling_strategy: "first:25%" data_root: https://sites.google.com/view/unidentified-video-object/dataset - json_path: annotations/video/reasoning_star-star-train_46k.json sampling_strategy: "first:25%" data_root: https://bobbywu.com/STAR/ - json_path: annotations/video/vcg-plus_112K_clean_97k.json sampling_strategy: "first:10%" data_root: http://activity-net.org/ - json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/Mutonix/Vript - json_path: annotations/video/guiworld_en_20241029_fix.jsonl sampling_strategy: "first:25%" data_root: https://gui-world.github.io/ ## llava video - json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl sampling_strategy: "first:25%" data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json sampling_strategy: "first:25%" data_root: https://coin-dataset.github.io/ - json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json sampling_strategy: "first:25%" data_root: https://github.com/LisaAnne/TemporalLanguageRelease - json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json sampling_strategy: "first:25%" data_root: https://hirest-cvpr2023.github.io/ - json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json sampling_strategy: "first:25%" data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/ - json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json sampling_strategy: "first:25%" data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT - json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json sampling_strategy: "first:25%" data_root: http://activity-net.org//train - json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset - json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json sampling_strategy: "first:25%" video_read_type: img data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset ================================================ FILE: llava-train_videochat/llava/__init__.py ================================================ from .model import LlavaQwenForCausalLM from .train.train import LazySupervisedDataset, DataCollatorForSupervisedDataset ================================================ FILE: llava-train_videochat/llava/constants.py ================================================ CONTROLLER_HEART_BEAT_EXPIRATION = 30 WORKER_HEART_BEAT_INTERVAL = 15 LOGDIR = "." # Model Constants IGNORE_INDEX = -100 IMAGE_TOKEN_INDEX = -200 DEFAULT_IMAGE_TOKEN = "" DEFAULT_IMAGE_PATCH_TOKEN = "" DEFAULT_IM_START_TOKEN = "" DEFAULT_IM_END_TOKEN = "" ================================================ FILE: llava-train_videochat/llava/conversation.py ================================================ import dataclasses from enum import auto, Enum from typing import List, Any, Dict, Union, Tuple import re import base64 from io import BytesIO from PIL import Image from transformers import AutoTokenizer class SeparatorStyle(Enum): """Different separator style.""" SINGLE = auto() TWO = auto() MPT = auto() PLAIN = auto() CHATML = auto() LLAMA_2 = auto() LLAMA_3 = auto() QWEN = auto() GEMMA = auto() @dataclasses.dataclass class Conversation: """A class that keeps all conversation history.""" system: str roles: List[str] messages: List[List[str]] offset: int sep_style: SeparatorStyle = SeparatorStyle.SINGLE sep: str = "###" sep2: str = None version: str = "Unknown" tokenizer_id: str = "" tokenizer: Any = None # Stop criteria (the default one is EOS token) stop_str: Union[str, List[str]] = None # Stops generation if meeting any token in this list stop_token_ids: List[int] = None skip_next: bool = False def get_prompt(self): messages = self.messages if len(messages) > 0 and type(messages[0][1]) is tuple: messages = self.messages.copy() init_role, init_msg = messages[0].copy() init_msg = init_msg[0] if "mmtag" in self.version: init_msg = init_msg.replace("", "").strip() messages[0] = (init_role, init_msg) messages.insert(0, (self.roles[0], "")) messages.insert(1, (self.roles[1], "Received.")) elif not init_msg.startswith(""): init_msg = init_msg.replace("", "").strip() messages[0] = (init_role, "\n" + init_msg) else: messages[0] = (init_role, init_msg) if self.sep_style == SeparatorStyle.SINGLE: ret = self.system + self.sep for role, message in messages: if message: if type(message) is tuple: message, _, _ = message ret += role + ": " + message + self.sep else: ret += role + ":" elif self.sep_style == SeparatorStyle.TWO: seps = [self.sep, self.sep2] ret = self.system + seps[0] for i, (role, message) in enumerate(messages): if message: if type(message) is tuple: message, _, _ = message ret += role + ": " + message + seps[i % 2] else: ret += role + ":" elif self.sep_style == SeparatorStyle.CHATML: ret = "" if self.system == "" else self.system + self.sep + "\n" for role, message in messages: if message: if type(message) is tuple: message, images, _ = message message = "" * len(images) + message ret += role + "\n" + message + self.sep + "\n" else: ret += role + "\n" return ret elif self.sep_style == SeparatorStyle.LLAMA_3: chat_template_messages = [{"role": "system", "content": self.system}] for role, message in messages: if message: if type(message) is tuple: message, images = message message = "" * len(images) + message chat_template_messages.append({"role": role, "content": message}) # print(chat_template_messages) return self.tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True) # ret = "" if self.system == "" else self.system + self.sep + "\n" # for role, message in messages: # if message: # if type(message) is tuple: # message, images = message # message = "" * len(images) + message # ret += role + "\n" + message + self.sep + "\n" # else: # ret += role + "\n" # return ret elif self.sep_style == SeparatorStyle.MPT: ret = self.system + self.sep for role, message in messages: if message: if type(message) is tuple: message, _, _ = message ret += role + message + self.sep else: ret += role elif self.sep_style == SeparatorStyle.GEMMA: ret = "" for i, (role, message) in enumerate(messages): assert role == self.roles[i % 2], "Conversation should alternate user/assistant/user/assistant/..." if message: if type(message) is tuple: message, _, _ = message ret += role + message + self.sep else: ret += role elif self.sep_style == SeparatorStyle.LLAMA_2: wrap_sys = lambda msg: f"<>\n{msg}\n<>\n\n" if len(msg) > 0 else msg wrap_inst = lambda msg: f"[INST] {msg} [/INST]" ret = "" for i, (role, message) in enumerate(messages): if i == 0: assert message, "first message should not be none" assert role == self.roles[0], "first message should come from user" if message: if type(message) is tuple: message, _, _ = message if i == 0: message = wrap_sys(self.system) + message if i % 2 == 0: message = wrap_inst(message) ret += self.sep + message else: ret += " " + message + " " + self.sep2 else: ret += "" ret = ret.lstrip(self.sep) elif self.sep_style == SeparatorStyle.PLAIN: seps = [self.sep, self.sep2] ret = self.system for i, (role, message) in enumerate(messages): if message: if type(message) is tuple: message, _, _ = message ret += message + seps[i % 2] else: ret += "" else: raise ValueError(f"Invalid style: {self.sep_style}") return ret def append_message(self, role, message): self.messages.append([role, message]) def process_image(self, image, image_process_mode, return_pil=False, image_format="PNG"): if image_process_mode == "Pad": def expand2square(pil_img, background_color=(122, 116, 104)): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result image = expand2square(image) elif image_process_mode in ["Default", "Crop"]: pass elif image_process_mode == "Resize": image = image.resize((336, 336)) else: raise ValueError(f"Invalid image_process_mode: {image_process_mode}") if type(image) is not Image.Image: image = Image.open(image).convert("RGB") max_hw, min_hw = max(image.size), min(image.size) aspect_ratio = max_hw / min_hw max_len, min_len = 672, 448 shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw)) longest_edge = int(shortest_edge * aspect_ratio) W, H = image.size if H > W: H, W = longest_edge, shortest_edge else: H, W = shortest_edge, longest_edge image = image.resize((W, H)) if return_pil: return image else: buffered = BytesIO() image.save(buffered, format=image_format) img_b64_str = base64.b64encode(buffered.getvalue()).decode() return img_b64_str def get_images(self, return_pil=False, return_path=False): images = [] for i, (role, msg) in enumerate(self.messages[self.offset :]): if i % 2 == 0: if type(msg) is tuple: msg, image, image_process_mode = msg if type(image) != list: image = [image] for img in image: if not return_path and self.is_image_file(img): img = self.process_image(img, image_process_mode, return_pil=return_pil) else: images.append(img) return images def is_image_file(self, filename): image_extensions = [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"] return any(filename.lower().endswith(ext) for ext in image_extensions) def is_video_file(self, filename): video_extensions = [".mp4", ".mov", ".avi", ".mkv", ".wmv", ".flv", ".mpeg", ".mpg"] return any(filename.lower().endswith(ext) for ext in video_extensions) def to_gradio_chatbot(self): ret = [] for i, (role, msg) in enumerate(self.messages[self.offset :]): if i % 2 == 0: if type(msg) is tuple: msg, image, image_process_mode = msg if type(image) != list: image = [image] if len(image) == 1: msg = "\n" + msg.replace("", "").strip() else: msg = re.sub(r"()\n(?=)", r"\1 ", msg) img_str_list = [] for img in image: if self.is_image_file(img): img_b64_str = self.process_image(img, "Default", return_pil=False, image_format="JPEG") img_str = f'' img_str_list.append(img_str) elif self.is_video_file(img): ret.append(((img,), None)) msg = msg.strip() img_place_holder = "" for img_str in img_str_list: img_place_holder += f"{img_str}\n\n" if len(img_str_list) > 0: msg = f"{img_place_holder}\n\n{msg}" if len(msg) > 0: ret.append([msg, None]) else: ret.append([msg, None]) else: ret[-1][-1] = msg return ret def copy(self): return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2, version=self.version) def dict(self): if len(self.get_images()) > 0: return { "system": self.system, "roles": self.roles, "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages], "offset": self.offset, "sep": self.sep, "sep2": self.sep2, } return { "system": self.system, "roles": self.roles, "messages": self.messages, "offset": self.offset, "sep": self.sep, "sep2": self.sep2, } conv_vicuna_v0 = Conversation( system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.", roles=("Human", "Assistant"), messages=[ ["Human", "What are the key differences between renewable and non-renewable energy sources?"], [ "Assistant", "Renewable energy sources are those that can be replenished naturally in a relatively " "short amount of time, such as solar, wind, hydro, geothermal, and biomass. " "Non-renewable energy sources, on the other hand, are finite and will eventually be " "depleted, such as coal, oil, and natural gas. Here are some key differences between " "renewable and non-renewable energy sources:\n" "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable " "energy sources are finite and will eventually run out.\n" "2. Environmental impact: Renewable energy sources have a much lower environmental impact " "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, " "and other negative effects.\n" "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically " "have lower operational costs than non-renewable sources.\n" "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote " "locations than non-renewable sources.\n" "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different " "situations and needs, while non-renewable sources are more rigid and inflexible.\n" "6. Sustainability: Renewable energy sources are more sustainable over the long term, while " "non-renewable sources are not, and their depletion can lead to economic and social instability.\n", ], ], offset=2, sep_style=SeparatorStyle.SINGLE, sep="###", ) conv_vicuna_v1 = Conversation( system="A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.", roles=("USER", "ASSISTANT"), version="v1", messages=[], offset=0, sep_style=SeparatorStyle.TWO, sep=" ", sep2="", ) conv_llama_2 = Conversation( system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""", roles=("USER", "ASSISTANT"), version="llama_v2", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_2, sep="", sep2="", ) conv_llava_llama_2 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", roles=("USER", "ASSISTANT"), version="llama_v2", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_2, sep="", sep2="", ) # conv_llava_llama_3 = Conversation( # system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", # roles=("user", "assistant"), # version="llama_v3", # messages=[], # offset=0, # sep="<|eot_id|>", # sep_style=SeparatorStyle.LLAMA_3, # tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct", # tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"), # stop_token_ids=[128009], # ) conv_mistral_instruct = Conversation( system="", roles=("USER", "ASSISTANT"), version="llama_v2", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_2, sep="", sep2="", ) conv_llava_llama_2_simple = Conversation( system="Answer the questions about the visual content that the user provides.", roles=("USER", "ASSISTANT"), version="llama_v2", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_2, sep="", sep2="", ) conv_llava_llama_2_mmtag = Conversation( system="Answer the questions about the visual content that the user provides." "The visual content will be provided with the following format: visual content.", roles=("USER", "ASSISTANT"), version="llama_v2_mmtag", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_2, sep="", sep2="", ) conv_mpt = Conversation( system="""<|im_start|>system A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""", roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), version="mpt", messages=[], offset=0, sep_style=SeparatorStyle.MPT, sep="<|im_end|>", ) conv_qwen = Conversation( system="""<|im_start|>system You are a helpful assistant.""", roles=("<|im_start|>user", "<|im_start|>assistant"), version="qwen", messages=[], offset=0, sep_style=SeparatorStyle.CHATML, sep="<|im_end|>", ) conv_internlm_2 = Conversation( system="""<|im_start|>system You are a helpful assistant.""", roles=("<|im_start|>user", "<|im_start|>assistant"), version="internlm_2", messages=[], offset=0, sep_style=SeparatorStyle.CHATML, sep="<|im_end|>", ) conv_gemma_instruct = Conversation(system="", roles=("user\n", "model\n"), version="gemma", messages=[], offset=0, sep_style=SeparatorStyle.GEMMA, sep="\n") conv_llava_plain = Conversation( system="", roles=("", ""), messages=[], offset=0, sep_style=SeparatorStyle.PLAIN, sep="\n", ) conv_llava_v0 = Conversation( system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.", roles=("Human", "Assistant"), messages=[], offset=0, sep_style=SeparatorStyle.SINGLE, sep="###", ) conv_llava_v0_mmtag = Conversation( system="A chat between a curious user and an artificial intelligence assistant. " "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language." "The visual content will be provided with the following format: visual content.", roles=("Human", "Assistant"), messages=[], offset=0, sep_style=SeparatorStyle.SINGLE, sep="###", version="v0_mmtag", ) conv_llava_v1 = Conversation( system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.", roles=("USER", "ASSISTANT"), version="v1", messages=[], offset=0, sep_style=SeparatorStyle.TWO, sep=" ", sep2="", ) conv_llava_v1_mmtag = Conversation( system="A chat between a curious user and an artificial intelligence assistant. " "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language." "The visual content will be provided with the following format: visual content.", roles=("USER", "ASSISTANT"), messages=[], offset=0, sep_style=SeparatorStyle.TWO, sep=" ", sep2="", version="v1_mmtag", ) conv_mistral_orca = Conversation( system="""<|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!""", roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), version="mpt", messages=[], offset=0, sep_style=SeparatorStyle.MPT, sep="<|im_end|>", ) conv_mistral_zephyr = Conversation( system="""<|system|> You are a helpful AI assistant.""", roles=("<|user|>\n", "<|assistant|>\n"), version="mpt", messages=[], offset=0, sep_style=SeparatorStyle.MPT, sep="", ) conv_mistral_direct = Conversation( system="""<|im_start|>system Answer the questions.""", roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), version="mpt", messages=[], offset=0, sep_style=SeparatorStyle.MPT, sep="<|im_end|>", ) conv_chatml_direct = Conversation( system="""<|im_start|>system Answer the questions.""", roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), version="mpt", messages=[], offset=0, sep_style=SeparatorStyle.MPT, sep="<|im_end|>", ) default_conversation = conv_vicuna_v0 conv_templates = { "default": conv_vicuna_v0, "v0": conv_vicuna_v0, "v1": conv_vicuna_v1, "vicuna_v1": conv_vicuna_v1, "llama_2": conv_llama_2, "mistral_instruct": conv_mistral_instruct, "mistral_orca": conv_mistral_orca, "mistral_zephyr": conv_mistral_zephyr, "mistral_direct": conv_mistral_direct, "plain": conv_llava_plain, "v0_plain": conv_llava_plain, "chatml_direct": conv_chatml_direct, "llava_v0": conv_llava_v0, "llava_v0_mmtag": conv_llava_v0_mmtag, "llava_v1": conv_llava_v1, "llava_v1_mmtag": conv_llava_v1_mmtag, "llava_llama_2": conv_llava_llama_2, # "llava_llama_3": conv_llava_llama_3, "llava_llama_2_simple": conv_llava_llama_2_simple, "llava_llama_2_mmtag": conv_llava_llama_2_mmtag, "llava_mistral_instruct": conv_mistral_instruct, "mpt": conv_mpt, "qwen_1_5": conv_qwen, "qwen_2": conv_qwen, "internlm_2": conv_internlm_2, "gemma_instruct": conv_gemma_instruct, } if __name__ == "__main__": print(default_conversation.get_prompt()) print(default_conversation) ================================================ FILE: llava-train_videochat/llava/dist_utils.py ================================================ import json import os import builtins import datetime import time import subprocess import torch import torch.distributed as dist def get_rank() -> int: if not dist.is_available(): return 0 if not dist.is_initialized(): return 0 return dist.get_rank() def get_world_size() -> int: if not dist.is_available(): return 1 if not dist.is_initialized(): return 1 return dist.get_world_size() def setup_for_distributed(is_master): builtin_print = builtins.print def print(*args, **kwargs): force = kwargs.pop("force", False) # force = force or (get_world_size() > 8) if is_master or force: now = datetime.datetime.now().time() builtin_print("[{}] ".format(now), end="") # print with time stamp builtin_print(*args, **kwargs) builtins.print = print def init_distributed_mode(use_dynamic_port: bool = True): if "SLURM_PROCID" in os.environ: rank = int(os.environ["SLURM_PROCID"]) local_rank = rank % torch.cuda.device_count() world_size = int(os.environ["SLURM_NTASKS"]) try: local_size = int(os.environ["SLURM_NTASKS_PER_NODE"]) except: local_size = int(os.environ.get("LOCAL_SIZE", 1)) if "MASTER_PORT" not in os.environ: port = 10023 # + random.randint(0, 20) # if use_dynamic_port: # for i in range(10042, 65535): # cmd = f"netstat -aon|grep {i}" # with os.popen(cmd, "r") as file: # if file.read() == "": # port = i # break print(f"MASTER_PORT = {port}") os.environ["MASTER_PORT"] = str(port) time.sleep(3) node_list = os.environ["SLURM_STEP_NODELIST"] addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1") if "MASTER_ADDR" not in os.environ: os.environ["MASTER_ADDR"] = addr os.environ["RANK"] = str(rank) os.environ["LOCAL_RANK"] = str(local_rank) os.environ["LOCAL_WORLD_SIZE"] = str(local_size) os.environ["WORLD_SIZE"] = str(world_size) else: rank = int(os.environ["RANK"]) setup_for_distributed(rank == 0) print( f"Rank {os.environ['RANK']} | Local Rank {os.environ['LOCAL_RANK']} | " f"World Size {os.environ['WORLD_SIZE']} | Local World Size {os.environ['LOCAL_WORLD_SIZE']} |", force=True ) ================================================ FILE: llava-train_videochat/llava/mm_utils.py ================================================ from PIL import Image from io import BytesIO import base64 import math import ast import re import torch from transformers import StoppingCriteria from llava.constants import IMAGE_TOKEN_INDEX def resize_and_center_crop(image, shortest_edge_length): # Calculate new dimensions and resize aspect_ratio = float(image.width) / float(image.height) if aspect_ratio > 1: new_width = int(shortest_edge_length * aspect_ratio) new_height = shortest_edge_length else: new_width = shortest_edge_length new_height = int(shortest_edge_length / aspect_ratio) resized_image = image.resize((new_width, new_height), Image.ANTIALIAS) # Calculate the position and perform the center crop left = (new_width - shortest_edge_length) / 2 top = (new_height - shortest_edge_length) / 2 right = (new_width + shortest_edge_length) / 2 bottom = (new_height + shortest_edge_length) / 2 cropped_image = resized_image.crop((left, top, right, bottom)) return cropped_image def auto_pad_images(image, grid_params): assert isinstance(image, Image.Image), "Input should be a Pillow Image" assert len(grid_params) > 0, "Grid parameters should not be empty" # Step 1: Calculate and find the closest aspect ratio input_width, input_height = image.size input_aspect_ratio = input_width / input_height candidate_resolutions = [(w / h, w, h) for w in grid_params for h in grid_params] closest_aspect_ratio = min(candidate_resolutions, key=lambda x: abs(input_aspect_ratio - x[0])) candidate_resolutions = [(x[1], x[2]) for x in candidate_resolutions if abs(x[0] - closest_aspect_ratio[0]) < 1e-3] target_resolution = min(candidate_resolutions, key=lambda res: abs(max(input_width, input_height) / max(res) - 1)) resize_width, resize_height = target_resolution if input_width > input_height: resize_height = int(resize_width / input_aspect_ratio) else: resize_width = int(resize_height * input_aspect_ratio) resized_image = image.resize((resize_width, resize_height), Image.ANTIALIAS) # Step 5: Pad the resized image if necessary to match the target resolution pad_width = target_resolution[0] - resize_width pad_height = target_resolution[1] - resize_height padded_image = Image.new("RGB", target_resolution, color=(0, 0, 0)) padded_image.paste(resized_image, (pad_width // 2, pad_height // 2)) return padded_image def extract_patches(image, patch_size, overlap_ratio): assert isinstance(image, Image.Image), "Input should be a Pillow Image" assert patch_size > 0, "Patch size should be greater than 0" assert 0 <= overlap_ratio < 1, "Overlap ratio should be between 0 and 1" W, H = image.size patches = [] stride = int(patch_size * (1 - overlap_ratio)) num_patches_y = (H - patch_size) // stride + 1 num_patches_x = (W - patch_size) // stride + 1 y_start = (H - (num_patches_y - 1) * stride - patch_size) // 2 x_start = (W - (num_patches_x - 1) * stride - patch_size) // 2 for y in range(y_start, y_start + num_patches_y * stride, stride): for x in range(x_start, x_start + num_patches_x * stride, stride): patch = image.crop((x, y, x + patch_size, y + patch_size)) patches.append(patch) return patches def process_highres_image_crop_split(image, data_args, processor=None): crop_resolution = data_args.image_crop_resolution split_resolution = data_args.image_split_resolution if processor is None: processor = data_args.image_processor image_crop = resize_and_center_crop(image, crop_resolution) image_patches = extract_patches(image_crop, patch_size=split_resolution, overlap_ratio=0) image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches] return torch.stack(image_patches, dim=0) def process_highres_image(image, processor, grid_pinpoints): grid_params = [int(x) for x in grid_pinpoints.split(",")] width_height = max(image.size) fit_grid_params = [x for x in grid_params if x >= width_height] if len(fit_grid_params) == 0: select_size = max(grid_params) else: select_size = min(fit_grid_params) # FIXME: always select the 448 select_size = max(grid_params) image_padded = expand2square(image, tuple(int(x * 255) for x in processor.image_mean)) # FIXME: this seems to be a bug that it always resizes instead of padding image_original_resize = image.resize((processor.size["shortest_edge"], processor.size["shortest_edge"])) image_padded = image_padded.resize((select_size, select_size)) image_patches = extract_patches(image_padded, patch_size=processor.size["shortest_edge"], overlap_ratio=0) image_patches = [image_original_resize] + image_patches image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches] return torch.stack(image_patches, dim=0) def select_best_resolution(original_size, possible_resolutions, max_resolutions, patch_size): """ Selects the best resolution from a list of possible resolutions based on the original size. Args: original_size (tuple): The original size of the image in the format (width, height). possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...]. Returns: tuple: The best fit resolution in the format (width, height). """ original_width, original_height = original_size best_fit = None max_effective_resolution = 0 min_wasted_resolution = float("inf") for width, height in possible_resolutions: if max_resolutions != None and (width * height != patch_size * patch_size): if (width * height+patch_size*patch_size) > max_resolutions: # NOTE 要算一个global continue # Calculate the downscaled size to keep the aspect ratio scale = min(width / original_width, height / original_height) downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale) # Calculate effective and wasted resolutions effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height) wasted_resolution = (width * height) - effective_resolution if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution): max_effective_resolution = effective_resolution min_wasted_resolution = wasted_resolution best_fit = (width, height) # print(f"original_size={original_size}, possible_resolutions={possible_resolutions}, max_resolutions={max_resolutions}, best_fit={best_fit}") assert best_fit is not None, f"Can't find suitable fit in {possible_resolutions} at max:{max_resolutions}" return best_fit def resize_and_pad_image(image, target_resolution): """ Resize and pad an image to a target resolution while maintaining aspect ratio. Args: image (PIL.Image.Image): The input image. target_resolution (tuple): The target resolution (width, height) of the image. Returns: PIL.Image.Image: The resized and padded image. """ original_width, original_height = image.size target_width, target_height = target_resolution # Determine which dimension (width or height) to fill scale_w = target_width / original_width scale_h = target_height / original_height if scale_w < scale_h: # Width will be filled completely new_width = target_width new_height = min(math.ceil(original_height * scale_w), target_height) else: # Height will be filled completely new_height = target_height new_width = min(math.ceil(original_width * scale_h), target_width) # Resize the image resized_image = image.resize((new_width, new_height)) # Create a new image with the target size and paste the resized image onto it new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0)) paste_x = (target_width - new_width) // 2 paste_y = (target_height - new_height) // 2 new_image.paste(resized_image, (paste_x, paste_y)) return new_image def divide_to_patches(image, patch_size): """ Divides an image into patches of a specified size. Args: image (PIL.Image.Image): The input image. patch_size (int): The size of each patch. Returns: list: A list of PIL.Image.Image objects representing the patches. """ patches = [] width, height = image.size for i in range(0, height, patch_size): for j in range(0, width, patch_size): box = (j, i, j + patch_size, i + patch_size) patch = image.crop(box) patches.append(patch) return patches def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size, max_resolutions=None): """ Calculate the shape of the image patch grid after the preprocessing for images of any resolution. Args: image_size (tuple): The size of the input image in the format (width, height). grid_pinpoints (str): A string representation of a list of possible resolutions. patch_size (int): The size of each image patch. Returns: tuple: The shape of the image patch grid in the format (width, height). """ if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints: assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] if type(grid_pinpoints) is list: possible_resolutions = grid_pinpoints else: possible_resolutions = ast.literal_eval(grid_pinpoints) width, height = select_best_resolution(image_size, possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size) # print("get width/patch size", width, patch_size, flush=True) return width // patch_size, height // patch_size def process_anyres_image(image, processor, grid_pinpoints): """ Process an image with variable resolutions. Args: image (PIL.Image.Image): The input image to be processed. processor: The image processor object. grid_pinpoints (str): A string representation of a list of possible resolutions. Returns: torch.Tensor: A tensor containing the processed image patches. """ raise NotImplementedError # Convert grid_pinpoints from string to list if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints: try: patch_size = processor.size[0] except Exception as e: patch_size = processor.size["shortest_edge"] assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] if type(grid_pinpoints) is list: possible_resolutions = grid_pinpoints else: possible_resolutions = ast.literal_eval(grid_pinpoints) best_resolution = select_best_resolution(image.size, possible_resolutions) image_padded = resize_and_pad_image(image, best_resolution) patches = divide_to_patches(image_padded, processor.crop_size["height"]) # FIXME: this seems to be a bug that it resizes instead of pad. # but to keep it consistent with previous, i will keep it as it is # TODO: uncomment below to ablate with the padding if isinstance(processor.size, dict): shortest_edge = processor.size["shortest_edge"] else: shortest_edge = min(processor.size) image_original_resize = image.resize((shortest_edge, shortest_edge)) # image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean)) # image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge'])) image_patches = [image_original_resize] + patches image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches] # print("image.size", image.size, "len(image_patches):", len(image_patches), "patch_size:", image_patches[0].shape) return torch.stack(image_patches, dim=0) def process_anyres_image_nopad(image, processor, grid_pinpoints): """ Process an image with variable resolutions. Args: image (PIL.Image.Image): The input image to be processed. processor: The image processor object. grid_pinpoints (str): A string representation of a list of possible resolutions. Returns: torch.Tensor: A tensor containing the processed image patches. """ # Convert grid_pinpoints from string to list try: patch_size = processor.size[0] except Exception as e: patch_size = processor.size["shortest_edge"] assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints: # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] if type(grid_pinpoints) is list: possible_resolutions = grid_pinpoints else: possible_resolutions = ast.literal_eval(grid_pinpoints) best_resolution = select_best_resolution(image.size, possible_resolutions, max_resolutions=None, patch_size=patch_size) # 目前图像无限制 # image_padded = resize_and_pad_image(image, best_resolution) patches = divide_to_patches(image.resize(best_resolution), patch_size) # FIXME: this seems to be a bug that it resizes instead of pad. # but to keep it consistent with previous, i will keep it as it is # TODO: uncomment below to ablate with the padding if isinstance(processor.size, dict): shortest_edge = processor.size["shortest_edge"] else: shortest_edge = min(processor.size) image_original_resize = image.resize((shortest_edge, shortest_edge)) # image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean)) # image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge'])) image_patches = [image_original_resize] + patches image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches] # raise ValueError(f"image.size: {image.size} len(image_patches): {len(image_patches)}, patch_size:, {image_patches[0].shape}, possible_resolutions:, {possible_resolutions}, best: {best_resolution}") return torch.stack(image_patches, dim=0) def process_anyres_video_nopad(video, processor, grid_pinpoints, max_resolutions): """ Process an image with variable resolutions. Args: video (numpy.ndarray): (T, H, W, C) image (PIL.Image.Image): The input image to be processed. processor: The image processor object. grid_pinpoints (str): A string representation of a list of possible resolutions. Returns: torch.Tensor: A tensor containing the processed image patches. """ # Convert grid_pinpoints from string to list try: patch_size = processor.size[0] except Exception as e: patch_size = processor.size["shortest_edge"] assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints: # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] if type(grid_pinpoints) is list: possible_resolutions = grid_pinpoints else: possible_resolutions = ast.literal_eval(grid_pinpoints) best_resolution = select_best_resolution(video[0].shape[0:2], possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size) video = processor.preprocess(video, return_tensors="pt", target_size=best_resolution)["pixel_values"] print("data: new_video.shape:", video.shape, "best_resolution:", best_resolution) return video def load_image_from_base64(image): return Image.open(BytesIO(base64.b64decode(image))) def expand2square(pil_img, background_color): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result def process_images(images, image_processor, model_cfg): image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None) new_images = [] if image_aspect_ratio == "highres": raise NotImplementedError for image in images: image = process_highres_image(image, image_processor, model_cfg.image_grid_pinpoints) new_images.append(image) elif "anyres" in image_aspect_ratio: for image in images: if "nopad" in image_aspect_ratio: image = process_anyres_image_nopad(image, image_processor, model_cfg.image_grid_pinpoints) else: image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints) new_images.append(image) elif image_aspect_ratio == "crop_split": raise NotImplementedError for image in images: image = process_highres_image_crop_split(image, model_cfg, image_processor) new_images.append(image) elif image_aspect_ratio == "pad": for image in images: image = expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean)) image = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0] new_images.append(image) else: return image_processor.preprocess(images, return_tensors="pt")["pixel_values"] if all(x.shape == new_images[0].shape for x in new_images): new_images = torch.stack(new_images, dim=0) return new_images def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None): prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("")] def insert_separator(X, sep): return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1] input_ids = [] offset = 0 if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id: offset = 1 input_ids.append(prompt_chunks[0][0]) for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)): input_ids.extend(x[offset:]) if return_tensors is not None: if return_tensors == "pt": return torch.tensor(input_ids, dtype=torch.long) raise ValueError(f"Unsupported tensor type: {return_tensors}") return input_ids def get_model_name_from_path(model_path): model_path = model_path.strip("/") model_paths = model_path.split("/") if model_paths[-1].startswith("checkpoint-"): return model_paths[-2] + "_" + model_paths[-1] else: return model_paths[-1] class KeywordsStoppingCriteria(StoppingCriteria): def __init__(self, keywords, tokenizer, input_ids): self.keywords = keywords self.keyword_ids = [] for keyword in keywords: cur_keyword_ids = tokenizer(keyword).input_ids if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id: cur_keyword_ids = cur_keyword_ids[1:] self.keyword_ids.append(torch.tensor(cur_keyword_ids)) self.tokenizer = tokenizer self.start_len = input_ids.shape[1] def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO offset = min(output_ids.shape[1] - self.start_len, 3) self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids] for keyword_id in self.keyword_ids: if output_ids[0, -keyword_id.shape[0] :] == keyword_id: return True outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0] for keyword in self.keywords: if keyword in outputs: return True return False ================================================ FILE: llava-train_videochat/llava/model/__init__.py ================================================ import os AVAILABLE_MODELS = { "llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig", "llava_qwen_flash": "LlavaQwenForCausalLM_Flash, LlavaQwenConfig_Flash" } for model_name, model_classes in AVAILABLE_MODELS.items(): try: exec(f"from .language_model.{model_name} import {model_classes}") except Exception as e: print(f"Failed to import {model_name} from llava.language_model.{model_name}. Error: {e}") ================================================ FILE: llava-train_videochat/llava/model/apply_delta.py ================================================ """ Usage: python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta """ import argparse import torch from tqdm import tqdm from transformers import AutoTokenizer, AutoModelForCausalLM from llava import LlavaLlamaForCausalLM def apply_delta(base_model_path, target_model_path, delta_path): print("Loading base model") base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True) print("Loading delta") delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True) delta_tokenizer = AutoTokenizer.from_pretrained(delta_path) print("Applying delta") for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"): if name not in base.state_dict(): assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model" continue if param.data.shape == base.state_dict()[name].shape: param.data += base.state_dict()[name] else: assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}" bparam = base.state_dict()[name] param.data[: bparam.shape[0], : bparam.shape[1]] += bparam print("Saving target model") delta.save_pretrained(target_model_path) delta_tokenizer.save_pretrained(target_model_path) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--base-model-path", type=str, required=True) parser.add_argument("--target-model-path", type=str, required=True) parser.add_argument("--delta-path", type=str, required=True) args = parser.parse_args() apply_delta(args.base_model_path, args.target_model_path, args.delta_path) ================================================ FILE: llava-train_videochat/llava/model/builder.py ================================================ # Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.utils import rank0_print def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", attn_implementation="flash_attention_2", customized_config=None, overwrite_config=None, **kwargs): kwargs["device_map"] = device_map if load_8bit: kwargs["load_in_8bit"] = True elif load_4bit: kwargs["load_in_4bit"] = True kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4") else: kwargs["torch_dtype"] = torch.float16 if customized_config is not None: kwargs["config"] = customized_config if "multimodal" in kwargs: if kwargs["multimodal"] is True: is_multimodal = True kwargs.pop("multimodal") else: is_multimodal = False else: is_multimodal = False assert is_multimodal, "I need it!!!" if "llava" in model_name.lower() or is_multimodal: # Load LLaVA model if "lora" in model_name.lower() and model_base is None: raise NotImplementedError("I don't like lora.") warnings.warn( "There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged." ) if "lora" in model_name.lower() and model_base is not None: raise NotImplementedError("I don't like lora.") lora_cfg_pretrained = AutoConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) rank0_print("Loading LLaVA from base model...") if "mixtral" in model_name.lower(): from llava.model.language_model.llava_mixtral import LlavaMixtralConfig lora_cfg_pretrained = LlavaMixtralConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs) elif "mistral" in model_name.lower(): from llava.model.language_model.llava_mistral import LlavaMistralConfig lora_cfg_pretrained = LlavaMistralConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs) elif "gemma" in model_name.lower(): from llava.model.language_model.llava_gemma import LlavaGemmaConfig lora_cfg_pretrained = LlavaGemmaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs) else: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) rank0_print("Loading additional LLaVA weights...") if os.path.exists(os.path.join(model_path, "non_lora_trainables.bin")): non_lora_trainables = torch.load(os.path.join(model_path, "non_lora_trainables.bin"), map_location="cpu") else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download(repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location="cpu") non_lora_trainables = load_from_hf(model_path, "non_lora_trainables.bin") non_lora_trainables = {(k[11:] if k.startswith("base_model.") else k): v for k, v in non_lora_trainables.items()} if any(k.startswith("model.model.") for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith("model.") else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel rank0_print("Loading LoRA weights...") model = PeftModel.from_pretrained(model, model_path) rank0_print("Merging LoRA weights...") model = model.merge_and_unload() rank0_print("Model is loaded...") elif model_base is not None: # this may be mm projector only, loading projector with preset language mdoel rank0_print(f"Loading LLaVA from base model {model_base}...") if "mixtral" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs) elif "mistral" in model_name.lower() or "zephyr" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs) elif "gemma" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs) elif ( "wizardlm-2" in model_name.lower() and "vicuna" in model_name.lower() or "llama" in model_name.lower() or "yi" in model_name.lower() or "nous-hermes" in model_name.lower() or "llava-v1.6-34b" in model_name.lower() or "llava-v1.5" in model_name.lower() ): from llava.model.language_model.llava_llama import LlavaConfig tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) if customized_config is None: llava_cfg = LlavaConfig.from_pretrained(model_path) if "v1.5" in model_name.lower(): llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models else: llava_cfg = customized_config tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) llava_cfg = LlavaConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=llava_cfg, **kwargs) else: raise ValueError(f"Model {model_name} not supported") mm_projector_weights = torch.load(os.path.join(model_path, "mm_projector.bin"), map_location="cpu") mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: rank0_print(f"Loaded LLaVA model: {model_path}") if "mixtral" in model_name.lower(): raise NotImplementedError("I don't like it.") from llava.model.language_model.llava_mixtral import LlavaMixtralConfig tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) if customized_config is None: llava_cfg = LlavaMixtralConfig.from_pretrained(model_path) else: llava_cfg = customized_config if overwrite_config is not None: rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) elif "mistral" in model_name.lower() or "zephyr" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) elif ( "wizardlm-2" in model_name.lower() and "vicuna" in model_name.lower() or "llama" in model_name.lower() # or "yi" in model_name.lower() # 太容易撞车了 or "nous-hermes" in model_name.lower() or "llava-v1.6-34b" in model_name.lower() or "llava-v1.5" in model_name.lower() ): raise NotImplementedError("I don't like it") from llava.model.language_model.llava_llama import LlavaConfig tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) if customized_config is None: llava_cfg = LlavaConfig.from_pretrained(model_path) if "v1.5" in model_name.lower(): llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models else: llava_cfg = customized_config if overwrite_config is not None: rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) elif "qwen" in model_name.lower() or "quyen" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) if "moe" in model_name.lower() or "A14B" in model_name.lower(): from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig if overwrite_config is not None: llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) elif "flash" in model_name.lower(): from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash if overwrite_config is not None: llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) else: from llava.model.language_model.llava_qwen import LlavaQwenConfig if overwrite_config is not None: llava_cfg = LlavaQwenConfig.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) elif "internlm2" in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) from llava.model.language_model.llava_internlm2 import LlavaInternLM2Config if overwrite_config is not None: llava_cfg = LlavaInternLM2Config.from_pretrained(model_path, trust_remote_code=True) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, trust_remote_code=True, **kwargs) else: model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, attn_implementation=attn_implementation, **kwargs) elif "gemma" in model_name.lower(): raise NotImplementedError("I don't like it") tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaGemmaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs) else: # 默认用qwen try: tokenizer = AutoTokenizer.from_pretrained(model_path) if "moe" in model_name.lower() or "A14B" in model_name.lower(): from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig if overwrite_config is not None: llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) elif "flash" in model_name.lower(): from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash if overwrite_config is not None: llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) elif "fastv" in model_name.lower(): from llava.model.language_model.llava_qwen_fastv import LlavaQwenConfig_FastV if overwrite_config is not None: llava_cfg = LlavaQwenConfig_FastV.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) else: from llava.model.language_model.llava_qwen import LlavaQwenConfig if overwrite_config is not None: llava_cfg = LlavaQwenConfig.from_pretrained(model_path) rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(llava_cfg, k, v) model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) else: model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs) except: raise ValueError(f"Model {model_name} not supported") # try: # from llava.model.language_model.llava_llama import LlavaConfig # tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) # if customized_config is None: # llava_cfg = LlavaConfig.from_pretrained(model_path) # if "v1.5" in model_path.lower(): # llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models # else: # llava_cfg = customized_config # if overwrite_config is not None: # rank0_print(f"Overwriting config with {overwrite_config}") # for k, v in overwrite_config.items(): # setattr(llava_cfg, k, v) # model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs) # except: # raise ValueError(f"Model {model_name} not supported") else: NotImplementedError("I don't want language model only.") # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto") print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print("Convert to FP16...") model.to(torch.float16) else: use_fast = False if "mpt" in model_name.lower().replace("prompt", ""): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) rank0_print(f"Model Class: {model.__class__.__name__}") image_processor = None if "llava" in model_name.lower() or is_multimodal: mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: vision_tower.load_model(device_map=device_map) if device_map != "auto": vision_tower.to(device="cuda", dtype=torch.float16) image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length elif hasattr(model.config, "max_position_embeddings"): context_len = model.config.max_position_embeddings elif hasattr(model.config, "tokenizer_model_max_length"): context_len = model.config.tokenizer_model_max_length else: context_len = 2048 return tokenizer, model, image_processor, context_len ================================================ FILE: llava-train_videochat/llava/model/consolidate.py ================================================ """ Usage: python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate """ import argparse import torch from transformers import AutoTokenizer, AutoModelForCausalLM from llava.model import * from llava.model.utils import auto_upgrade def consolidate_ckpt(src_path, dst_path): print("Loading model") auto_upgrade(src_path) src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True) src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False) src_model.save_pretrained(dst_path) src_tokenizer.save_pretrained(dst_path) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--src", type=str, required=True) parser.add_argument("--dst", type=str, required=True) args = parser.parse_args() consolidate_ckpt(args.src, args.dst) ================================================ FILE: llava-train_videochat/llava/model/language_model/llava_qwen.py ================================================ # Copyright 2024 Hao Zhang # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from typing import List, Optional, Tuple, Union, Dict import torch import torch.nn as nn from torch.nn import CrossEntropyLoss import transformers from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM from transformers.modeling_outputs import CausalLMOutputWithPast from transformers.generation.utils import GenerateOutput # from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM from transformers import Qwen2Config, Qwen2Model, Qwen2ForCausalLM # from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel # from .qwen.configuration_qwen import QWenConfig class LlavaQwenConfig(Qwen2Config): model_type = "llava_qwen" class LlavaQwenModel(LlavaMetaModel, Qwen2Model): config_class = LlavaQwenConfig def __init__(self, config: Qwen2Config): super(LlavaQwenModel, self).__init__(config) class LlavaQwenForCausalLM(Qwen2ForCausalLM, LlavaMetaForCausalLM): config_class = LlavaQwenConfig def __init__(self, config): # super(Qwen2ForCausalLM, self).__init__(config) Qwen2ForCausalLM.__init__(self, config) config.model_type = "llava_qwen" config.rope_scaling = None self.model = LlavaQwenModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) # Initialize weights and apply final processing self.post_init() def get_model(self): return self.model def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, images: Optional[torch.FloatTensor] = None, image_sizes: Optional[List[List[int]]] = None, return_dict: Optional[bool] = None, modalities: Optional[List[str]] = ["image"], dpo_forward: Optional[bool] = False, cache_position=None, ) -> Union[Tuple, CausalLMOutputWithPast]: # print("images[0].shape:", images[0].shape) if inputs_embeds is None: (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes) # print("inputs_embeds.shape:", inputs_embeds.shape) if dpo_forward: outputs = self.model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) hidden_states = outputs[0] logits = self.lm_head(hidden_states) return logits, labels else: return super().forward( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) @torch.no_grad() def generate( self, inputs: Optional[torch.Tensor] = None, images: Optional[torch.Tensor] = None, image_sizes: Optional[torch.Tensor] = None, modalities: Optional[List[str]] = ["image"], **kwargs, ) -> Union[GenerateOutput, torch.LongTensor]: position_ids = kwargs.pop("position_ids", None) attention_mask = kwargs.pop("attention_mask", None) if "inputs_embeds" in kwargs: raise NotImplementedError("`inputs_embeds` is not supported") if images is not None: (inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes) else: inputs_embeds = self.get_model().embed_tokens(inputs) return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs) def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs): images = kwargs.pop("images", None) image_sizes = kwargs.pop("image_sizes", None) inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs) if images is not None: inputs["images"] = images if image_sizes is not None: inputs["image_sizes"] = image_sizes return inputs AutoConfig.register("llava_qwen", LlavaQwenConfig) AutoModelForCausalLM.register(LlavaQwenConfig, LlavaQwenForCausalLM) ================================================ FILE: llava-train_videochat/llava/model/language_model/llava_qwen_flash.py ================================================ # Copyright 2024 Hao Zhang # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from typing import List, Optional, Tuple, Union, Dict import torch import torch.nn as nn from torch.nn import CrossEntropyLoss import transformers from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM from transformers.modeling_outputs import CausalLMOutputWithPast from transformers.generation.utils import GenerateOutput # from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM from transformers import Qwen2Config # from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel # from .qwen.configuration_qwen import QWenConfig from .modeling_qwen2_flash import Qwen2Model_Flash, Qwen2ForCausalLM_Flash class LlavaQwenConfig_Flash(Qwen2Config): model_type = "llava_qwen_flash" class LlavaQwenModel_Flash(LlavaMetaModel, Qwen2Model_Flash): config_class = LlavaQwenConfig_Flash def __init__(self, config: Qwen2Config): super(LlavaQwenModel_Flash, self).__init__(config) class LlavaQwenForCausalLM_Flash(Qwen2ForCausalLM_Flash, LlavaMetaForCausalLM): config_class = LlavaQwenConfig_Flash def __init__(self, config): # super(Qwen2ForCausalLM, self).__init__(config) Qwen2ForCausalLM_Flash.__init__(self, config) config.model_type = "llava_qwen_flash" # config.rope_scaling = None self.model = LlavaQwenModel_Flash(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) # Initialize weights and apply final processing self.post_init() def get_model(self): return self.model def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, images: Optional[torch.FloatTensor] = None, image_sizes: Optional[List[List[int]]] = None, return_dict: Optional[bool] = None, modalities: Optional[List[str]] = ["image"], dpo_forward: Optional[bool] = False, cache_position=None, ) -> Union[Tuple, CausalLMOutputWithPast]: if inputs_embeds is None: (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes) # print("inputs_embeds.shape:", inputs_embeds.shape) if dpo_forward: outputs, labels = self.model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, labels=labels ) hidden_states = outputs[0] logits = self.lm_head(hidden_states) return logits, labels else: return super().forward( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) @torch.no_grad() def generate( self, inputs: Optional[torch.Tensor] = None, images: Optional[torch.Tensor] = None, image_sizes: Optional[torch.Tensor] = None, modalities: Optional[List[str]] = ["image"], **kwargs, ) -> Union[GenerateOutput, torch.LongTensor]: position_ids = kwargs.pop("position_ids", None) attention_mask = kwargs.pop("attention_mask", None) if "inputs_embeds" in kwargs: raise NotImplementedError("`inputs_embeds` is not supported") if images is not None: (inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes) else: self.model.image_token_posi = [-1] self.model.prompt_len = None self.model.image_tokens = [0] inputs_embeds = self.get_model().embed_tokens(inputs) return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs) def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs): images = kwargs.pop("images", None) image_sizes = kwargs.pop("image_sizes", None) inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs) if images is not None: inputs["images"] = images if image_sizes is not None: inputs["image_sizes"] = image_sizes return inputs AutoConfig.register("llava_qwen_flash", LlavaQwenConfig_Flash) AutoModelForCausalLM.register(LlavaQwenConfig_Flash, LlavaQwenForCausalLM_Flash) ================================================ FILE: llava-train_videochat/llava/model/language_model/modeling_qwen2_flash.py ================================================ # coding=utf-8 # transformers==4.39.2 or 4.40.1 NOTE # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved. # # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX # and OPT implementations in this library. It has been modified from its # original forms to accommodate minor architectural differences compared # to GPT-NeoX and OPT used by the Meta AI team that trained the model. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ PyTorch Qwen2 model.""" import inspect import math import warnings from typing import List, Optional, Tuple, Union import torch import torch.nn.functional as F import torch.utils.checkpoint from torch import nn from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from transformers.activations import ACT2FN from transformers.cache_utils import Cache, DynamicCache from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast from transformers.modeling_utils import PreTrainedModel from transformers.utils import ( add_start_docstrings, add_start_docstrings_to_model_forward, is_flash_attn_2_available, is_flash_attn_greater_or_equal_2_10, logging, replace_return_docstrings, ) from transformers.models.qwen2.configuration_qwen2 import Qwen2Config from llava.constants import IGNORE_INDEX if is_flash_attn_2_available(): from flash_attn import flash_attn_func, flash_attn_varlen_func from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters) logger = logging.get_logger(__name__) _CHECKPOINT_FOR_DOC = "Qwen/Qwen2-7B-beta" _CONFIG_FOR_DOC = "Qwen2Config" QWEN2_PRETRAINED_MODEL_ARCHIVE_LIST = [ "Qwen/Qwen2-7B-beta", # See all Qwen2 models at https://huggingface.co/models?filter=qwen2 ] # Copied from transformers.models.llama.modeling_llama._get_unpad_data def _get_unpad_data(attention_mask): seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32) indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() max_seqlen_in_batch = seqlens_in_batch.max().item() cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0)) return ( indices, cu_seqlens, max_seqlen_in_batch, ) # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Qwen2 class Qwen2RMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ Qwen2RMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) # Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Qwen2 class Qwen2RotaryEmbedding(nn.Module): def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings self.base = base inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim)) self.register_buffer("inv_freq", inv_freq, persistent=False) # Build here to make `torch.jit.trace` work. self._set_cos_sin_cache( seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype() ) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached = seq_len t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq) freqs = torch.outer(t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb = torch.cat((freqs, freqs), dim=-1) self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False) self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False) def forward(self, x, seq_len=None): # x: [bs, num_attention_heads, seq_len, head_size] if seq_len > self.max_seq_len_cached: self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype) return ( self.cos_cached[:seq_len].to(dtype=x.dtype), self.sin_cached[:seq_len].to(dtype=x.dtype), ) # Copied from transformers.models.llama.modeling_llama.rotate_half def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim=-1) # Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1): """Applies Rotary Position Embedding to the query and key tensors. Args: q (`torch.Tensor`): The query tensor. k (`torch.Tensor`): The key tensor. cos (`torch.Tensor`): The cosine part of the rotary embedding. sin (`torch.Tensor`): The sine part of the rotary embedding. position_ids (`torch.Tensor`): The position indices of the tokens corresponding to the query and key tensors. For example, this can be used to pass offsetted position ids when working with a KV-cache. unsqueeze_dim (`int`, *optional*, defaults to 1): The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. Returns: `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. """ cos = cos[position_ids].unsqueeze(unsqueeze_dim) sin = sin[position_ids].unsqueeze(unsqueeze_dim) q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed # Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2 class Qwen2MLP(nn.Module): def __init__(self, config): super().__init__() self.config = config self.hidden_size = config.hidden_size self.intermediate_size = config.intermediate_size self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) self.act_fn = ACT2FN[config.hidden_act] def forward(self, x): return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) # Copied from transformers.models.llama.modeling_llama.repeat_kv def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor: """ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim) """ batch, num_key_value_heads, slen, head_dim = hidden_states.shape if n_rep == 1: return hidden_states hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim) return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) class Qwen2Attention(nn.Module): """ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer and "Generating Long Sequences with Sparse Transformers". """ def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None): super().__init__() self.config = config self.layer_idx = layer_idx if layer_idx is None: logger.warning_once( f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " "when creating this class." ) self.hidden_size = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.hidden_size // self.num_heads self.num_key_value_heads = config.num_key_value_heads self.num_key_value_groups = self.num_heads // self.num_key_value_heads self.max_position_embeddings = config.max_position_embeddings self.rope_theta = config.rope_theta self.is_causal = True self.attention_dropout = config.attention_dropout if (self.head_dim * self.num_heads) != self.hidden_size: raise ValueError( f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" f" and `num_heads`: {self.num_heads})." ) self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) self.rotary_emb = Qwen2RotaryEmbedding( self.head_dim, max_position_embeddings=self.max_position_embeddings, base=self.rope_theta, ) def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Cache] = None, output_attentions: bool = False, use_cache: bool = False, **kwargs, ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: if "padding_mask" in kwargs: warnings.warn( "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ) bsz, q_len, _ = hidden_states.size() query_states = self.q_proj(hidden_states) key_states = self.k_proj(hidden_states) value_states = self.v_proj(hidden_states) query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) kv_seq_len = key_states.shape[-2] if past_key_value is not None: if self.layer_idx is None: raise ValueError( f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " "with a layer index." ) kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) if past_key_value is not None: cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) # repeat k/v heads if n_kv_heads < n_heads key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len): raise ValueError( f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" f" {attn_weights.size()}" ) if attention_mask is not None: if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): raise ValueError( f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" ) attn_weights = attn_weights + attention_mask # upcast attention to fp32 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) attn_output = torch.matmul(attn_weights, value_states) if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim): raise ValueError( f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" f" {attn_output.size()}" ) attn_output = attn_output.transpose(1, 2).contiguous() attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) attn_output = self.o_proj(attn_output) if not output_attentions: attn_weights = None return attn_output, attn_weights, past_key_value class Qwen2FlashAttention2(Qwen2Attention): """ Qwen2 flash attention module, following Qwen2 attention module. This module inherits from `Qwen2Attention` as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom config.max_window_layers layers. """ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1. # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0. # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left). self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10() def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Cache] = None, output_attentions: bool = False, use_cache: bool = False, **kwargs, ): if "padding_mask" in kwargs: warnings.warn( "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ) # overwrite attention_mask with padding_mask attention_mask = kwargs.pop("padding_mask") bsz, q_len, _ = hidden_states.size() query_states = self.q_proj(hidden_states) key_states = self.k_proj(hidden_states) value_states = self.v_proj(hidden_states) query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) kv_seq_len = key_states.shape[-2] if past_key_value is not None: if self.layer_idx is None: raise ValueError( f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " "with a layer index." ) kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) # Because the input can be padded, the absolute sequence length depends on the max position id. rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1 cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) use_sliding_windows = ( _flash_supports_window_size and getattr(self.config, "sliding_window", None) is not None and kv_seq_len > self.config.sliding_window and self.config.use_sliding_window ) if not _flash_supports_window_size: logger.warning_once( "The current flash attention version does not support sliding window attention, for a more memory efficient implementation" " make sure to upgrade flash-attn library." ) if past_key_value is not None: # Activate slicing cache only if the config has a value `sliding_windows` attribute cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0 if ( getattr(self.config, "sliding_window", None) is not None and kv_seq_len > self.config.sliding_window and cache_has_contents ): slicing_tokens = 1 - self.config.sliding_window past_key = past_key_value[self.layer_idx][0] past_value = past_key_value[self.layer_idx][1] past_key = past_key[:, :, slicing_tokens:, :].contiguous() past_value = past_value[:, :, slicing_tokens:, :].contiguous() if past_key.shape[-2] != self.config.sliding_window - 1: raise ValueError( f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got" f" {past_key.shape}" ) if attention_mask is not None: attention_mask = attention_mask[:, slicing_tokens:] attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1) cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) # repeat k/v heads if n_kv_heads < n_heads key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) dropout_rate = 0.0 if not self.training else self.attention_dropout # In PEFT, usually we cast the layer norms in float32 for training stability reasons # therefore the input hidden states gets silently casted in float32. Hence, we need # cast them back in float16 just to be sure everything works as expected. input_dtype = query_states.dtype if input_dtype == torch.float32: if torch.is_autocast_enabled(): target_dtype = torch.get_autocast_gpu_dtype() # Handle the case where the model is quantized elif hasattr(self.config, "_pre_quantization_dtype"): target_dtype = self.config._pre_quantization_dtype else: target_dtype = self.q_proj.weight.dtype logger.warning_once( f"The input hidden states seems to be silently casted in float32, this might be related to" f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in" f" {target_dtype}." ) query_states = query_states.to(target_dtype) key_states = key_states.to(target_dtype) value_states = value_states.to(target_dtype) # Reashape to the expected shape for Flash Attention query_states = query_states.transpose(1, 2) key_states = key_states.transpose(1, 2) value_states = value_states.transpose(1, 2) attn_output = self._flash_attention_forward( query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate, use_sliding_windows=use_sliding_windows, ) attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous() attn_output = self.o_proj(attn_output) if not output_attentions: attn_weights = None return attn_output, attn_weights, past_key_value def _flash_attention_forward( self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None, use_sliding_windows=False, ): """ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token first unpad the input, then computes the attention scores and pad the final attention scores. Args: query_states (`torch.Tensor`): Input query states to be passed to Flash Attention API key_states (`torch.Tensor`): Input key states to be passed to Flash Attention API value_states (`torch.Tensor`): Input value states to be passed to Flash Attention API attention_mask (`torch.Tensor`): The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the position of padding tokens and 1 for the position of non-padding tokens. dropout (`float`): Attention dropout softmax_scale (`float`, *optional*): The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) use_sliding_windows (`bool`, *optional*): Whether to activate sliding window attention. """ if not self._flash_attn_uses_top_left_mask: causal = self.is_causal else: # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__. causal = self.is_causal and query_length != 1 # Decide whether to use SWA or not by layer index. if use_sliding_windows and self.layer_idx >= self.config.max_window_layers: use_sliding_windows = False # Contains at least one padding token in the sequence if attention_mask is not None: batch_size = query_states.shape[0] query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input( query_states, key_states, value_states, attention_mask, query_length ) cu_seqlens_q, cu_seqlens_k = cu_seq_lens max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens if not use_sliding_windows: attn_output_unpad = flash_attn_varlen_func( query_states, key_states, value_states, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=max_seqlen_in_batch_q, max_seqlen_k=max_seqlen_in_batch_k, dropout_p=dropout, softmax_scale=softmax_scale, causal=causal, ) else: attn_output_unpad = flash_attn_varlen_func( query_states, key_states, value_states, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=max_seqlen_in_batch_q, max_seqlen_k=max_seqlen_in_batch_k, dropout_p=dropout, softmax_scale=softmax_scale, causal=causal, window_size=(self.config.sliding_window, self.config.sliding_window), ) attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length) else: if not use_sliding_windows: attn_output = flash_attn_func( query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal, ) else: attn_output = flash_attn_func( query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal, window_size=(self.config.sliding_window, self.config.sliding_window), ) return attn_output # Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._upad_input def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length): batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape # On the first iteration we need to properly re-create the padding mask # by slicing it on the proper place if kv_seq_len != attention_mask.shape[-1]: attention_mask_num_tokens = attention_mask.shape[-1] attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :] indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask) key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k) value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k) if query_length == kv_seq_len: query_layer = index_first_axis( query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k ) cu_seqlens_q = cu_seqlens_k max_seqlen_in_batch_q = max_seqlen_in_batch_k indices_q = indices_k elif query_length == 1: max_seqlen_in_batch_q = 1 cu_seqlens_q = torch.arange( batch_size + 1, dtype=torch.int32, device=query_layer.device ) # There is a memcpy here, that is very bad. indices_q = cu_seqlens_q[:-1] query_layer = query_layer.squeeze(1) else: # The -q_len: slice assumes left padding. attention_mask = attention_mask[:, -query_length:] query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask) return ( query_layer, key_layer, value_layer, indices_q, (cu_seqlens_q, cu_seqlens_k), (max_seqlen_in_batch_q, max_seqlen_in_batch_k), ) # Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Qwen2 class Qwen2SdpaAttention(Qwen2Attention): """ Qwen2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from `Qwen2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API. """ # Adapted from Qwen2Attention.forward def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Cache] = None, output_attentions: bool = False, use_cache: bool = False, ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: if output_attentions: # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented. logger.warning_once( "Qwen2Model is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, " 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.' ) return super().forward( hidden_states=hidden_states, attention_mask=attention_mask, position_ids=position_ids, past_key_value=past_key_value, output_attentions=output_attentions, use_cache=use_cache, ) bsz, q_len, _ = hidden_states.size() query_states = self.q_proj(hidden_states) key_states = self.k_proj(hidden_states) value_states = self.v_proj(hidden_states) query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) kv_seq_len = key_states.shape[-2] if past_key_value is not None: kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) if past_key_value is not None: cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) if attention_mask is not None: if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): raise ValueError( f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" ) # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask, # Reference: https://github.com/pytorch/pytorch/issues/112577. if query_states.device.type == "cuda" and attention_mask is not None: query_states = query_states.contiguous() key_states = key_states.contiguous() value_states = value_states.contiguous() attn_output = torch.nn.functional.scaled_dot_product_attention( query_states, key_states, value_states, attn_mask=attention_mask, dropout_p=self.attention_dropout if self.training else 0.0, # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1. is_causal=self.is_causal and attention_mask is None and q_len > 1, ) attn_output = attn_output.transpose(1, 2).contiguous() attn_output = attn_output.view(bsz, q_len, self.hidden_size) attn_output = self.o_proj(attn_output) return attn_output, None, past_key_value QWEN2_ATTENTION_CLASSES = { "eager": Qwen2Attention, "flash_attention_2": Qwen2FlashAttention2, "sdpa": Qwen2SdpaAttention, } class Qwen2DecoderLayer(nn.Module): def __init__(self, config: Qwen2Config, layer_idx: int): super().__init__() self.hidden_size = config.hidden_size if config.use_sliding_window and config._attn_implementation != "flash_attention_2": logger.warning_once( f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; " "unexpected results may be encountered." ) self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) self.mlp = Qwen2MLP(config) self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps) def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Tuple[torch.Tensor]] = None, output_attentions: Optional[bool] = False, use_cache: Optional[bool] = False, **kwargs, ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]: if "padding_mask" in kwargs: warnings.warn( "Passing `padding_mask` is deprecated and will be removed in v4.37. " "Please make sure use `attention_mask` instead.`" ) """ Args: hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`, *optional*): attention mask of size `(batch, sequence_length)` where padding elements are indicated by 0. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states """ residual = hidden_states hidden_states = self.input_layernorm(hidden_states) # Self Attention hidden_states, self_attn_weights, present_key_value = self.self_attn( hidden_states=hidden_states, attention_mask=attention_mask, position_ids=position_ids, past_key_value=past_key_value, output_attentions=output_attentions, use_cache=use_cache, ) hidden_states = residual + hidden_states # Fully Connected residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states outputs = (hidden_states,) if output_attentions: outputs += (self_attn_weights,) if use_cache: outputs += (present_key_value,) return outputs QWEN2_START_DOCSTRING = r""" This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: config ([`Qwen2Config`]): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. """ @add_start_docstrings( "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.", QWEN2_START_DOCSTRING, ) class Qwen2PreTrainedModel(PreTrainedModel): config_class = Qwen2Config base_model_prefix = "model" supports_gradient_checkpointing = True _no_split_modules = ["Qwen2DecoderLayer"] _skip_keys_device_placement = "past_key_values" _supports_flash_attn_2 = True _supports_sdpa = True _supports_cache_class = True def _init_weights(self, module): std = self.config.initializer_range if isinstance(module, nn.Linear): module.weight.data.normal_(mean=0.0, std=std) if module.bias is not None: module.bias.data.zero_() elif isinstance(module, nn.Embedding): module.weight.data.normal_(mean=0.0, std=std) if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() QWEN2_INPUTS_DOCSTRING = r""" Args: input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`). If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy. - 1 indicates the head is **not masked**, - 0 indicates the head is **masked**. position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids) past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*): Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. Two formats are allowed: - a [`~cache_utils.Cache`] instance; - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy cache format. The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the legacy cache format will be returned. If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. output_hidden_states (`bool`, *optional*): Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail. return_dict (`bool`, *optional*): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. """ @add_start_docstrings( "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.", QWEN2_START_DOCSTRING, ) class Qwen2Model_Flash(Qwen2PreTrainedModel): """ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`] Args: config: Qwen2Config """ def __init__(self, config: Qwen2Config): super().__init__(config) self.padding_idx = config.pad_token_id self.vocab_size = config.vocab_size self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) self.layers = nn.ModuleList( [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] ) self._attn_implementation = config._attn_implementation self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.gradient_checkpointing = False # Initialize weights and apply final processing self.post_init() def get_input_embeddings(self): return self.embed_tokens def set_input_embeddings(self, value): self.embed_tokens = value @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING) def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, labels: Optional[torch.Tensor] = None, ) -> Union[Tuple, BaseModelOutputWithPast]: output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states ) use_cache = use_cache if use_cache is not None else self.config.use_cache return_dict = return_dict if return_dict is not None else self.config.use_return_dict # retrieve input_ids and inputs_embeds if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time") elif input_ids is not None: batch_size, seq_length = input_ids.shape elif inputs_embeds is not None: batch_size, seq_length, _ = inputs_embeds.shape else: raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds") if self.gradient_checkpointing and self.training: if use_cache: logger.warning_once( "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." ) use_cache = False past_key_values_length = 0 if use_cache: use_legacy_cache = not isinstance(past_key_values, Cache) if use_legacy_cache: past_key_values = DynamicCache.from_legacy_cache(past_key_values) past_key_values_length = past_key_values.get_usable_length(seq_length) if position_ids is None: device = input_ids.device if input_ids is not None else inputs_embeds.device position_ids = torch.arange( past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device ) position_ids = position_ids.unsqueeze(0).view(-1, seq_length) else: position_ids = position_ids.view(-1, seq_length).long() if inputs_embeds is None: inputs_embeds = self.embed_tokens(input_ids) if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache: is_padding_right = attention_mask[:, -1].sum().item() != batch_size if is_padding_right: raise ValueError( "You are attempting to perform batched generation with padding_side='right'" " this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to " " call `tokenizer.padding_side = 'left'` before tokenizing the input. " ) if self._attn_implementation == "flash_attention_2": # 2d mask is passed through the layers attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None elif self._attn_implementation == "sdpa" and not output_attentions: # output_attentions=True can not be supported when using SDPA, and we fall back on # the manual implementation that requires a 4D causal mask in all cases. attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length, ) else: # 4d mask is passed through the layers attention_mask = _prepare_4d_causal_attention_mask( attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length, sliding_window=self.config.sliding_window, ) hidden_states = inputs_embeds # decoder layers all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None next_decoder_cache = None for layer_idx, decoder_layer in enumerate(self.layers): if output_hidden_states: all_hidden_states += (hidden_states,) if self.gradient_checkpointing and self.training: layer_outputs = self._gradient_checkpointing_func( decoder_layer.__call__, hidden_states, attention_mask, position_ids, past_key_values, output_attentions, use_cache, ) else: layer_outputs = decoder_layer( hidden_states, attention_mask=attention_mask, position_ids=position_ids, past_key_value=past_key_values, output_attentions=output_attentions, use_cache=use_cache, ) hidden_states = layer_outputs[0] if use_cache: next_decoder_cache = layer_outputs[2 if output_attentions else 1] if output_attentions: all_self_attns += (layer_outputs[1],) ###### copy from pdrop ######### # rank & drop after specific layer # only drop in prefill stage when inference rank_layer = layer_idx+1 if rank_layer in self.llm_compress_layer_list: if hidden_states.shape[1] != 1: # prefill stage or training stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage ( position_ids, attention_mask, hidden_states, labels # update labels and return ) = self.flash_rank_drop( cur_num = stage, rank_layer = rank_layer, features = hidden_states, position_ids=position_ids, attention_mask=attention_mask, labels = labels ) # process attention_mask again after updating if self._attn_implementation == "flash_attention_2": # 2d mask is passed through the layers attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None elif self._attn_implementation == "sdpa" and not output_attentions: # output_attentions=True can not be supported when using SDPA, and we fall back on # the manual implementation that requires a 4D causal mask in all cases. attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( attention_mask, (batch_size, hidden_states.shape[1]), hidden_states, past_key_values_length, ) else: # 4d mask is passed through the layers attention_mask = _prepare_4d_causal_attention_mask( attention_mask, (batch_size, hidden_states.shape[1]), hidden_states, past_key_values_length, sliding_window=self.config.sliding_window, ) else: # update position_ids in decoding stage when inference stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage cur_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage]) for cur_image_token in self.num_image_token_lens] next_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage + 1]) for cur_image_token in self.num_image_token_lens] new_position_ids = [] for idx, cur_position_ids in enumerate(position_ids): cur_position_ids = cur_position_ids - (cur_visual_length[idx] - next_visual_length[idx]) new_position_ids.append(cur_position_ids) assert idx == 0, idx position_ids = torch.tensor(new_position_ids, dtype=torch.long).unsqueeze(0) # raise ValueError(f"{type(position_ids)}, 哈哈我疯了") ################# hidden_states = self.norm(hidden_states) # add hidden states from the last decoder layer if output_hidden_states: all_hidden_states += (hidden_states,) next_cache = None if use_cache: next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache if not return_dict: return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None), labels return BaseModelOutputWithPast( last_hidden_state=hidden_states, past_key_values=next_cache, hidden_states=all_hidden_states, attentions=all_self_attns, ), labels # implementation of flash def flash_rank_drop( self, cur_num, rank_layer, features , position_ids, attention_mask, labels ): if self.llm_compress_type == 'uniform0_attention': if cur_num == 0: llm_compress_type = 'uniform' else: llm_compress_type = 'attention' elif self.llm_compress_type == 'uniform1_attention': if cur_num <= 1: llm_compress_type = 'uniform' else: llm_compress_type = 'attention' else: llm_compress_type = self.llm_compress_type _labels = labels _position_ids = position_ids _attention_mask = attention_mask if position_ids is None: position_ids = torch.arange(0, features.shape[1], dtype=torch.long, device=features.device).unsqueeze(0) if getattr(self.config, 'tokenizer_padding_side', 'right') == "right": batch_size = features.shape[0] image_tokens = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num]) for cur_image_token in self.num_image_token_lens] keep_length = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num + 1]) for cur_image_token in self.num_image_token_lens] features_list = [] attention_mask_list = [] labels_list = [] if attention_mask is None: attention_mask = torch.ones((batch_size,features.shape[1]), dtype=torch.bool, device=features.device) else: attention_mask = attention_mask.bool() if labels is None: labels = torch.full((batch_size,features.shape[1]), IGNORE_INDEX, device=features.device) if 'attention' in llm_compress_type: # obtain query_states and key_states to calculate attention map hidden_states= features.clone().detach() # print(f"hidden_states.shape: {hidden_states.shape}") self_attn = self.layers[rank_layer].self_attn hidden_states = self.layers[rank_layer].input_layernorm(hidden_states) # print(f"new hidden_states.shape: {hidden_states.shape}") num_heads = self_attn.num_heads num_key_value_heads = self_attn.num_key_value_heads head_dim = self_attn.head_dim bsz, q_len, _ = hidden_states.size() # print(self_attn.k_proj) query_states = self_attn.q_proj(hidden_states) key_states = self_attn.k_proj(hidden_states) value_states = self_attn.v_proj(hidden_states) # print("old key_states.shape:", key_states.shape) query_states = query_states.view(bsz, q_len, num_heads, head_dim).transpose(1, 2) key_states = key_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2) value_states = value_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2) # print("key_states.shape:", key_states.shape) kv_seq_len = key_states.shape[-2] cos, sin = self_attn.rotary_emb(value_states, seq_len=kv_seq_len) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) key_states = repeat_kv(key_states, self_attn.num_key_value_groups) # attention_mask eager_attention_mask = _prepare_4d_causal_attention_mask( attention_mask, (batch_size, q_len), hidden_states, past_key_values_length=0 ).to(device=query_states.device) # take valid features features = [cur_features[cur_attention_mask] for cur_features, cur_attention_mask in zip(features, attention_mask)] labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)] attention_mask = [cur_attention_mask[cur_attention_mask] for cur_attention_mask, cur_attention_mask in zip(attention_mask, attention_mask)] # rank & drop for i in range(batch_size): image_index = self.first_image_token_position[i] if image_index == -1: cur_input_embeds = features[i] features_list.append(cur_input_embeds) attention_mask_list.append(attention_mask[i]) labels_list.append(labels[i]) continue if 'attention' in llm_compress_type: # obtain current states cur_key_states = key_states[i] cur_query_states = query_states[i] cur_eager_attention_mask = eager_attention_mask[i] # choose last instruction token as query if self.training: answer_index = torch.where(labels[i] != -100)[0].tolist() index_before_answer = [] for index in answer_index: if labels[i][index-1] == -100: index_before_answer.append(index-1) if index_before_answer == []: # print("========index_before_answer is []") cur_input_embeds = features[i] features_list.append(cur_input_embeds) attention_mask_list.append(attention_mask[i]) labels_list.append(labels[i]) continue index_before_answer=torch.tensor(index_before_answer,device=labels[0].device) text_query_states = cur_query_states[:,index_before_answer,:] text_eager_attention_mask = cur_eager_attention_mask[:,index_before_answer,:] else: prompt_total_len = self.text_prompt_lens[i] + image_tokens[i] text_query_states = cur_query_states[:,prompt_total_len-1,:].unsqueeze(1) text_eager_attention_mask = cur_eager_attention_mask[:,prompt_total_len-1,:].unsqueeze(1) # print(f"text_query_states.shape: {text_query_states.shape}") # print(f"cur_key_states.shape: {cur_key_states.shape}") # calculate attention map attn_weights = torch.matmul(text_query_states, cur_key_states.transpose(1, 2)) / math.sqrt(head_dim) #(num_head, text_token,seq_len) attn_weights = attn_weights + text_eager_attention_mask attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) #(num_head, text_token,seq_len) attention_avg_head = torch.mean(attn_weights, dim=0) # ave across heads attention_avg_head = attention_avg_head[:,image_index:image_index+image_tokens[i]] # select image token as keys attention_avg_text = torch.mean(attention_avg_head, dim=0) # (576) # print("attention_avg_text.shape:", attention_avg_text.shape) if llm_compress_type == 'attention': top_rank_index = attention_avg_text.topk(keep_length[i]).indices else: raise NotImplementedError(llm_compress_type) elif llm_compress_type == 'uniform': top_rank_index = torch.linspace(0, image_tokens[i]-1, keep_length[i], dtype=torch.long) else: raise NotImplementedError(llm_compress_type) top_rank_index = top_rank_index + image_index top_rank_index= top_rank_index.sort().values start_index = image_index + image_tokens[i] new_input_embeds = torch.cat([features[i][ :image_index, :] ,features[i][ top_rank_index, :], features[i][start_index:, :]], dim=0) # print("origin labels:", len(labels)) # print(labels[i]) # print(top_rank_index) new_labels = torch.cat([labels[i][ :image_index],labels[i][ top_rank_index], labels[i][start_index:]], dim=0) new_attention_mask = torch.cat([attention_mask[i][:image_index], attention_mask[i][top_rank_index], attention_mask[i][start_index:]], dim=0) features_list.append(new_input_embeds) attention_mask_list.append(new_attention_mask) labels_list.append(new_labels) # Truncate sequences to max length as image embeddings can make the sequence longer tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None) if tokenizer_model_max_length is not None: new_input_embeds = [x[:tokenizer_model_max_length] for x in features_list] new_attention_mask = [x[:tokenizer_model_max_length] for x in attention_mask_list] new_labels = [x[:tokenizer_model_max_length] for x in labels_list] max_len = max(x.shape[0] for x in new_input_embeds) # padding the sequences to form batch embeds_padded=[] labels_paded=[] attention_mask_padded=[] position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device) for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)): cur_len_emb=cur_new_embed.shape[0] dif=max_len - cur_len_emb # padding to longest seq cur_new_embed = torch.cat([cur_new_embed,torch.zeros((dif, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)],dim=0) cur_new_labels = torch.cat([cur_new_labels,torch.full((dif,),IGNORE_INDEX,dtype=cur_new_labels.dtype, device=cur_new_labels.device)],dim=0) cur_attention_mask = new_attention_mask[i] cur_attention_mask = torch.cat([cur_attention_mask,torch.full((dif,),False, dtype=cur_attention_mask.dtype, device=cur_attention_mask.device)],dim=0) embeds_padded.append(cur_new_embed) labels_paded.append(cur_new_labels) attention_mask_padded.append(cur_attention_mask) cur_len = new_attention_mask[i].sum().item() position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device) new_input_embeds = torch.stack(embeds_padded,dim=0) new_input_embeds = new_input_embeds.to(features[0].dtype) new_attention_mask = torch.stack(attention_mask_padded,dim=0) new_labels = torch.stack(labels_paded,dim=0) if _position_ids is None: position_ids = None if _labels is None: new_labels = None if _attention_mask is None: new_attention_mask = None else: new_attention_mask = new_attention_mask.to(dtype=_attention_mask.dtype) return position_ids, new_attention_mask, new_input_embeds, new_labels else: raise ValueError(f"Unexpected tokenizer_padding_side: {self.config.tokenizer_padding_side}") class Qwen2ForCausalLM_Flash(Qwen2PreTrainedModel): _tied_weights_keys = ["lm_head.weight"] def __init__(self, config): super().__init__(config) self.model = Qwen2Model_Flash(config) self.vocab_size = config.vocab_size self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) # Initialize weights and apply final processing self.post_init() def get_input_embeddings(self): return self.model.embed_tokens def set_input_embeddings(self, value): self.model.embed_tokens = value def get_output_embeddings(self): return self.lm_head def set_output_embeddings(self, new_embeddings): self.lm_head = new_embeddings def set_decoder(self, decoder): self.model = decoder def get_decoder(self): return self.model @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC) def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, CausalLMOutputWithPast]: r""" Args: labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. Returns: Example: ```python >>> from transformers import AutoTokenizer, Qwen2ForCausalLM >>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER) >>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ```""" output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states ) return_dict = return_dict if return_dict is not None else self.config.use_return_dict # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) outputs, labels = self.model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, labels=labels ) hidden_states = outputs[0] logits = self.lm_head(hidden_states) logits = logits.float() loss = None if labels is not None: # Shift so that tokens < n predict n shift_logits = logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous() # Flatten the tokens loss_fct = CrossEntropyLoss() shift_logits = shift_logits.view(-1, self.config.vocab_size) shift_labels = shift_labels.view(-1) # Enable model parallelism shift_labels = shift_labels.to(shift_logits.device) loss = loss_fct(shift_logits, shift_labels) if not return_dict: output = (logits,) + outputs[1:] return (loss,) + output if loss is not None else output return CausalLMOutputWithPast( loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) def prepare_inputs_for_generation( self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs ): # Omit tokens covered by past_key_values if past_key_values is not None: if isinstance(past_key_values, Cache): cache_length = past_key_values.get_seq_length() past_length = past_key_values.seen_tokens max_cache_length = past_key_values.get_max_length() else: cache_length = past_length = past_key_values[0][0].shape[2] max_cache_length = None # Keep only the unprocessed tokens: # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as # input) if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]: input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :] # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard # input_ids based on the past_length. elif past_length < input_ids.shape[1]: input_ids = input_ids[:, past_length:] # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens. # If we are about to go beyond the maximum cache length, we need to crop the input attention mask. if ( max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] > max_cache_length ): attention_mask = attention_mask[:, -max_cache_length:] position_ids = kwargs.get("position_ids", None) if attention_mask is not None and position_ids is None: # create position_ids on the fly for batch generation position_ids = attention_mask.long().cumsum(-1) - 1 position_ids.masked_fill_(attention_mask == 0, 1) if past_key_values: position_ids = position_ids[:, -input_ids.shape[1] :] # if `inputs_embeds` are passed, we only want to use them in the 1st generation step if inputs_embeds is not None and past_key_values is None: model_inputs = {"inputs_embeds": inputs_embeds} else: model_inputs = {"input_ids": input_ids} model_inputs.update( { "position_ids": position_ids, "past_key_values": past_key_values, "use_cache": kwargs.get("use_cache"), "attention_mask": attention_mask, } ) return model_inputs @staticmethod def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: reordered_past += ( tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), ) return reordered_past ================================================ FILE: llava-train_videochat/llava/model/llava_arch.py ================================================ # Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from abc import ABC, abstractmethod import psutil import math import re import time import torch import torch.nn as nn import torch.nn.functional as F from .multimodal_encoder.builder import build_vision_tower from .multimodal_projector.builder import build_vision_projector from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.mm_utils import get_anyres_image_grid_shape from llava.utils import rank0_print import random class LlavaMetaModel: def __init__(self, config): super(LlavaMetaModel, self).__init__(config) if hasattr(config, "mm_vision_tower"): delay_load = getattr(config, "delay_load", False) self.vision_tower = build_vision_tower(config, delay_load=delay_load) self.mm_projector = build_vision_projector(config, vision_cfg=self.vision_tower.config) if "unpad" in getattr(config, "mm_patch_merge_type", ""): self.image_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype)) if "nopad" in getattr(config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing": self.frame_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype)) def get_vision_tower(self): vision_tower = getattr(self, "vision_tower", None) if type(vision_tower) is list: vision_tower = vision_tower[0] return vision_tower def initialize_vision_modules(self, model_args, fsdp=None): vision_tower = model_args.vision_tower mm_vision_select_layer = model_args.mm_vision_select_layer mm_vision_select_feature = model_args.mm_vision_select_feature pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter mm_patch_merge_type = model_args.mm_patch_merge_type self.config.mm_vision_tower = vision_tower self.config.vision_tower_pretrained = getattr(model_args, "vision_tower_pretrained", "") if self.get_vision_tower() is None: vision_tower = build_vision_tower(model_args) if fsdp is not None and len(fsdp) > 0: self.vision_tower = [vision_tower] else: self.vision_tower = vision_tower else: if fsdp is not None and len(fsdp) > 0: vision_tower = self.vision_tower[0] else: vision_tower = self.vision_tower vision_tower.load_model() self.config.use_mm_proj = True self.config.mm_projector_type = getattr(model_args, "mm_projector_type", "linear") self.config.mm_hidden_size = vision_tower.hidden_size self.config.mm_vision_select_layer = mm_vision_select_layer self.config.mm_vision_select_feature = mm_vision_select_feature self.config.mm_patch_merge_type = mm_patch_merge_type if getattr(self, "mm_projector", None) is None: self.mm_projector = build_vision_projector(self.config, vision_cfg=vision_tower.config) if "unpad" in mm_patch_merge_type: embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype)) self.image_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std) if "nopad" in getattr(self.config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing": embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype)) self.frame_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std) else: # In case it is frozen by LoRA for p in self.mm_projector.parameters(): p.requires_grad = True if pretrain_mm_mlp_adapter is not None: mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu") def get_w(weights, keyword): return {k.split(keyword + ".")[1]: v for k, v in weights.items() if keyword in k} if self.config.mm_projector_type =='lxh_qformer': incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"), strict=False) else: incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector")) rank0_print(f"Loaded mm projector weights from {pretrain_mm_mlp_adapter}. Incompatible keys: {incompatible_keys}") def unpad_image(tensor, original_size, is_frame=False): """ Unpads a PyTorch tensor of a padded and resized image. Args: tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format. original_size (tuple): The original size of the image (height, width). Returns: torch.Tensor: The unpadded image tensor. """ original_width, original_height = original_size if is_frame: current_height, current_width = tensor.shape[2:] else: current_height, current_width = tensor.shape[1:] # Compute aspect ratios original_aspect_ratio = original_width / original_height current_aspect_ratio = current_width / current_height # Determine padding size and direction if original_aspect_ratio > current_aspect_ratio: # Padding was added to the height scale_factor = current_width / original_width new_height = int(original_height * scale_factor) padding = (current_height - new_height) // 2 if is_frame: unpadded_tensor = tensor[:, :, padding : current_height - padding, :] else: unpadded_tensor = tensor[:, padding : current_height - padding, :] else: # Padding was added to the width scale_factor = current_height / original_height new_width = int(original_width * scale_factor) padding = (current_width - new_width) // 2 if is_frame: unpadded_tensor = tensor[:, :, :, padding : current_width - padding] else: unpadded_tensor = tensor[:, :, padding : current_width - padding] return unpadded_tensor class LlavaMetaForCausalLM(ABC): @abstractmethod def get_model(self): pass def get_vision_tower(self): return self.get_model().get_vision_tower() def get_4dPool(self, image_feature): num_frames, num_tokens, num_dim = image_feature.shape height = width = int(math.sqrt(num_tokens)) assert num_tokens == height * width, image_feature.shape image_feature = image_feature.view(num_frames, height, width, -1) image_feature = image_feature.permute(0, 3, 1, 2).contiguous() # image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride) if self.config.mm_spatial_pool_mode == "average": raise NotImplementedError image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride) elif self.config.mm_spatial_pool_mode == "max": raise NotImplementedError image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride) elif self.config.mm_spatial_pool_mode == "bilinear": height, weight = image_feature.shape[2:] scaled_shape = [math.ceil(height / 4), math.ceil(weight / 4)] image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear') else: raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}") image_feature = image_feature.permute(0, 2, 3, 1) image_feature = image_feature.view(num_frames, -1, num_dim) return image_feature def get_2dPool(self, image_feature): num_frames, num_tokens, num_dim = image_feature.shape height = width = int(math.sqrt(num_tokens)) assert num_tokens == height * width, image_feature.shape image_feature = image_feature.view(num_frames, height, width, -1) image_feature = image_feature.permute(0, 3, 1, 2).contiguous() # image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride) if self.config.mm_spatial_pool_mode == "average": raise NotImplementedError image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride) elif self.config.mm_spatial_pool_mode == "max": raise NotImplementedError image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride) elif self.config.mm_spatial_pool_mode == "bilinear": height, weight = image_feature.shape[2:] scaled_shape = [math.ceil(height / 2), math.ceil(weight / 2)] image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear') else: raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}") image_feature = image_feature.permute(0, 2, 3, 1) image_feature = image_feature.view(num_frames, -1, num_dim) return image_feature def encode_image(self, images_list): concat_images = torch.cat([image for image in images_list], dim=0) split_sizes = [image.shape[0] for image in images_list] image_features = self.get_model().get_vision_tower()(concat_images) image_features = self.get_model().mm_projector(image_features) image_features = torch.split(image_features, split_sizes) return image_features def encode_image_video(self, images_list, video_idx_in_batch): concat_images = torch.cat([image for image in images_list], dim=0) split_sizes = [image.shape[0] for image in images_list] videos_or_images_features = self.get_model().get_vision_tower()(concat_images) per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096) all_videos_or_images_features = [] for idx, feat in enumerate(per_videos_or_images_features): if idx in video_idx_in_batch: feat = self.get_model().mm_projector(feat, compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1)) else: feat = self.get_model().mm_projector(feat, compress=False) all_videos_or_images_features.append(feat) return all_videos_or_images_features def encode_video(self, images_list, video_idx_in_batch): bs = len(images_list) concat_images = [] concat_videos = [] for idx, image in enumerate(images_list): if idx in video_idx_in_batch: concat_videos.append(image) else: concat_images.append(image) # print(concat_videos[0].shape) has_image = len(concat_images) > 0 has_video = len(concat_videos) > 0 mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1) assert mm_local_num_frames != -1 if has_image: image_split_sizes = [image.shape[0] for image in concat_images] concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0) images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D images_features = self.get_model().mm_projector(images_features, compress=False, local_num_frames=1) images_features = torch.split(images_features, image_split_sizes) if has_video: video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos] concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0) # B T C H W videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D videos_features = self.get_model().mm_projector(videos_features, compress=True, local_num_frames=mm_local_num_frames) videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)] all_videos_or_images_features = [] img_idx = 0 vid_idx = 0 for idx in range(bs): if idx in video_idx_in_batch: feat =videos_features[vid_idx] vid_idx += 1 else: feat = images_features[img_idx] img_idx += 1 all_videos_or_images_features.append(feat) if has_video: assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}" if has_image: assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}" return all_videos_or_images_features def encode_video_image(self, images_list, video_idx_in_batch): bs = len(images_list) concat_images = [] concat_videos = [] for idx, image in enumerate(images_list): if idx in video_idx_in_batch: concat_videos.append(image) else: concat_images.append(image) # print(concat_videos[0].shape) has_image = len(concat_images) > 0 has_video = len(concat_videos) > 0 mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1) assert mm_local_num_frames != -1 if has_image: image_split_sizes = [image.shape[0] for image in concat_images] concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0) # print("input vit image.shape:", concat_images.shape) images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D images_features = torch.split(images_features, image_split_sizes) if has_video: video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos] concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0) # print("input vit video.shape:", concat_videos.shape) videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)] all_videos_or_images_features = [] img_idx = 0 vid_idx = 0 for idx in range(bs): if idx in video_idx_in_batch: feat = self.get_model().mm_projector(videos_features[vid_idx], compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1)) vid_idx += 1 else: feat = self.get_model().mm_projector(images_features[img_idx], compress=False) img_idx += 1 all_videos_or_images_features.append(feat) if has_video: assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}" if has_image: assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}" return all_videos_or_images_features def add_token_per_frame(self, image_feature): image_feature = image_feature.permute(2, 0, 1).contiguous() if hasattr(self.model, "frame_newline"): image_feature = torch.cat((image_feature, self.model.frame_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1) else: image_feature = torch.cat((image_feature, self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1) image_feature = image_feature.permute(1, 2, 0).contiguous() return image_feature def add_different_token_per_frame(self, image_feature): raise NotImplementedError("No") def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities=["image"], image_sizes=None): assert type(modalities) is list, modalities vision_tower = self.get_vision_tower() # rank_print(modalities) if vision_tower is None or images is None or input_ids.shape[1] == 1: return input_ids, position_ids, attention_mask, past_key_values, None, labels if type(images) is list or images.ndim == 5: if type(images) is list: images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images] video_idx_in_batch = [] for _ in range(len(modalities)): if modalities[_] == "video": video_idx_in_batch.append(_) images_list = [] for image in images: if image.ndim == 4: images_list.append(image) else: images_list.append(image.unsqueeze(0)) vision_encode_type = getattr(self.config, "vision_encode_type", "image") mm_patch_merge_type = getattr(self.config, "mm_patch_merge_type", "flat") image_aspect_ratio = getattr(self.config, "image_aspect_ratio", "square") frame_aspect_ratio = getattr(self.config, "frame_aspect_ratio", "square") mm_newline_position = getattr(self.config, "mm_newline_position", "nothing") if "anyres" in frame_aspect_ratio: new_images_list = [] num_frames_list = [] for idx, image in enumerate(images_list): if idx in video_idx_in_batch: T, C, H, W = image.shape num_frames_list.append(T) # print("origin video.shape:", image.shape) # T C H W patch_size = self.get_vision_tower().image_size if H * W != patch_size * patch_size: global_image = F.interpolate( image.float(), size=(patch_size, patch_size), mode='bicubic', align_corners=False ).to(image.dtype).unsqueeze(0) sub_image = image.reshape( 1, T, C, H//patch_size, patch_size, W//patch_size, patch_size ).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T, C, patch_size, patch_size).contiguous() new_image = torch.concat([global_image, sub_image], dim=0).flatten(0, 1) else: new_image = image # print("new video shape:", new_image.shape) new_images_list.append(new_image) else: num_frames_list.append(1) new_images_list.append(image) images_list = new_images_list # rank0_print(self.config) # TODO image: share vit&connector for image/video, image_video:, video if vision_encode_type == "image": # image backbone, process video by frame image_features = self.encode_image(images_list) elif vision_encode_type == "video": # video backbone, process video with compress image_features = self.encode_video(images_list, video_idx_in_batch=video_idx_in_batch) elif vision_encode_type == "image_video": # image backbone, process video with compress image_features = self.encode_image_video(images_list, video_idx_in_batch=video_idx_in_batch) elif vision_encode_type == "image_video_new": image_features = self.encode_image_video_new(images_list, video_idx_in_batch=video_idx_in_batch) elif vision_encode_type == "video_image": # image backbone, process video with compress image_features = self.encode_video_image(images_list, video_idx_in_batch=video_idx_in_batch) else: raise NotImplementedError(vision_encode_type) if 'llava_ov' in getattr(self.config, "transformers_version", ""): new_image_features = [] # print("I am llava ov!!!!!!!") for idx, image_feat in enumerate(image_features): if idx in video_idx_in_batch: # NOTE lxh: I don't want it. new_image_features.append(self.get_2dPool(image_feat)) else: new_image_features.append(image_feat) image_features = new_image_features if mm_patch_merge_type == "flat": image_features = [x.flatten(0, 1) for x in image_features] elif mm_patch_merge_type.startswith("spatial"): new_image_features = [] for image_idx, image_feature in enumerate(image_features): # FIXME: now assume the image is square, and split to 2x2 patches # num_patches = h * w, where h = w = sqrt(num_patches) # currently image_feature is a tensor of shape (4, num_patches, hidden_size) # we want to first unflatten it to (2, 2, h, w, hidden_size) # rank0_print("At least we are reaching here") if image_idx in video_idx_in_batch: # video operations # rank0_print("Video") # rank0_print(f"video image_feature.shape: {image_feature.shape}") if "anyres" in frame_aspect_ratio: if "anyres_max" in frame_aspect_ratio: matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", frame_aspect_ratio) if matched_anyres_max_num_patches: max_num_patches = int(matched_anyres_max_num_patches.group(1)) num_frames = num_frames_list[image_idx] if hasattr(self.get_vision_tower(), "image_size"): vision_tower_image_size = self.get_vision_tower().image_size else: raise ValueError("vision_tower_image_size is not found in the vision tower.") try: num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.frame_grid_pinpoints, vision_tower_image_size, max_resolutions=self.config.max_num_pixels // num_frames) #TODO 要传个num_frames来算 except Exception as e: rank0_print(f"Error: {e}, self.config:{self.config}") raise e height = width = self.get_model().mm_projector.num_frame_patches_per_side if "maxpool2x2" in mm_patch_merge_type: raise NotImplementedError elif "unpad" in mm_patch_merge_type and "anyres_max" in frame_aspect_ratio and matched_anyres_max_num_patches: raise NotImplementedError elif "unpad" in mm_patch_merge_type and "anyres" in frame_aspect_ratio: raise NotImplementedError else: # rank0_print(f"652 num_frames={num_frames}") if num_patch_height * num_patch_width != 1: # has global image_feature = image_feature.view(num_patch_height * num_patch_width + 1, -1, height, width, image_feature.shape[-1]) assert num_frames == image_feature.shape[1], f"{num_frames} != {image_feature.shape[1]}" base_frame_feature = image_feature[0].view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C # rank0_print(f"base_frame_feature.shape: {base_frame_feature.shape}") image_feature = image_feature[1:].permute(1, 0, 2, 3, 4) # T P 4 4 C frame_feature = image_feature.view(num_frames, num_patch_height, num_patch_width, height, width, -1) frame_feature = frame_feature.permute(0, 1, 3, 2, 4, 5).contiguous() frame_feature = frame_feature.flatten(1, 4) frame_feature = torch.cat((base_frame_feature, frame_feature), dim=1) # rank0_print(f"two_frame_feature.shape: {frame_feature.shape}") else: # no global frame_feature = image_feature.view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C # rank0_print(f"only_frame_feature.shape: {frame_feature.shape}") if "nobase" in mm_patch_merge_type: raise NotImplementedError else: frame_feature = image_feature if "pad" in mm_patch_merge_type: # unpad和nopad都算 if mm_newline_position == 'one_token': frame_feature = frame_feature.flatten(0, 1) if "unpad" in mm_patch_merge_type: frame_feature = torch.cat((frame_feature, self.model.image_newline[None].to(frame_feature.device)), dim=0) else: frame_feature = torch.cat((frame_feature, self.model.frame_newline[None].to(frame_feature.device)), dim=0) elif mm_newline_position == 'frame': # Frame-wise frame_feature = self.add_token_per_frame(frame_feature) frame_feature = frame_feature.flatten(0, 1) elif mm_newline_position == 'frame2': # Frame-wise raise NotImplementedError elif mm_newline_position == 'nothing': frame_feature = frame_feature.flatten(0, 1) else: raise NotImplementedError("add pad please!!") else: frame_feature = frame_feature.flatten(0, 1) # rank0_print(f"final video frame_feature.shape: {frame_feature.shape}") image_feature = frame_feature elif image_feature.shape[0] > 1: # multi patches and multi images operations # rank0_print("Single-images") NOTE: 多图实际上不会过这里,因为被拆成多个单图pad了 base_image_feature = image_feature[0] image_feature = image_feature[1:] origin_size = image_feature.shape height = width = self.get_model().mm_projector.num_image_patches_per_side # NOTE 写死一个图49 assert height * width == base_image_feature.shape[0], f"height:{height}, width: {width}, base_image_feature: {base_image_feature.shape}" if "anyres_max" in image_aspect_ratio: matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", image_aspect_ratio) if matched_anyres_max_num_patches: max_num_patches = int(matched_anyres_max_num_patches.group(1)) if "anyres" in image_aspect_ratio: if hasattr(self.get_vision_tower(), "image_size"): vision_tower_image_size = self.get_vision_tower().image_size else: raise ValueError("vision_tower_image_size is not found in the vision tower.") try: num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, vision_tower_image_size, max_resolutions=None) except Exception as e: rank0_print(f"Error: {e}") raise e # num_patch_width, num_patch_height = 2, 2 image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1) else: raise NotImplementedError(image_aspect_ratio) image_feature = image_feature.view(2, 2, height, width, -1) if "maxpool2x2" in mm_patch_merge_type: raise NotImplementedError image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous() image_feature = image_feature.flatten(1, 2).flatten(2, 3) image_feature = nn.functional.max_pool2d(image_feature, 2) image_feature = image_feature.flatten(1, 2).transpose(0, 1) elif "unpad" in mm_patch_merge_type and "anyres_max" in image_aspect_ratio and matched_anyres_max_num_patches: raise NotImplementedError elif "unpad" in mm_patch_merge_type: raise NotImplementedError else: image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous() image_feature = image_feature.flatten(0, 3) if "nobase" in mm_patch_merge_type: pass else: try: image_feature = torch.cat((base_image_feature, image_feature), dim=0) except Exception as e: raise ValueError(f"{num_patch_width} {num_patch_height} now: base_image_feature: {base_image_feature.shape}, {image_feature.shape}, image_sizes[image_idx]: {image_sizes[image_idx]}, origin_size: {origin_size}, {image_sizes[image_idx]}, {self.config.image_grid_pinpoints}, {vision_tower_image_size}") else: # single image operations image_feature = image_feature[0] if "unpad" in mm_patch_merge_type: image_feature = torch.cat((image_feature, self.model.image_newline[None]), dim=0) # rank0_print(f"image/video_feature.shape: {image_feature.shape}") new_image_features.append(image_feature) image_features = new_image_features else: raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}") else: # raise NotImplementedError(f"images.shape={images.shape}, modalities={modalities}") image_features = self.encode_image(images) # TODO: image start / end is not implemented here to support pretraining. if getattr(self.config, "tune_mm_mlp_adapter", False) and getattr(self.config, "mm_use_im_start_end", False): raise NotImplementedError # rank0_print(f"Total images len(image_features: {len(image_features)}") # Let's just add dummy tensors if they do not exist, # it is a headache to deal with None all the time. # But it is not ideal, and if you have a better idea, # please open an issue / submit a PR, thanks. _labels = labels _position_ids = position_ids _attention_mask = attention_mask if attention_mask is None: attention_mask = torch.ones_like(input_ids, dtype=torch.bool) else: attention_mask = attention_mask.bool() if position_ids is None: position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device) if labels is None: labels = torch.full_like(input_ids, IGNORE_INDEX) # remove the padding using attention_mask -- FIXME _input_ids = input_ids input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)] labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)] new_input_embeds = [] new_labels = [] cur_image_idx = 0 mm_llm_compress = getattr(self.config, "mm_llm_compress", False) if mm_llm_compress: self.model.llm_compress_type = getattr(self.config, "llm_compress_type", "attention") self.model.llm_compress_layer_list = getattr(self.config, "llm_compress_layer_list", [8, 16, 24]) self.model.llm_image_token_ratio_list = getattr(self.config, "llm_image_token_ratio_list", [1.0, 0.5, 0.25, 0.125]) first_image_token_position = [] text_prompt_lens = [] else: self.model.llm_compress_type = "attention" self.model.llm_compress_layer_list = [] self.model.llm_image_token_ratio_list = [] # rank_print("Inserting Images embedding") for batch_idx, cur_input_ids in enumerate(input_ids): num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum() if mm_llm_compress: ####### copy from pdrop, only support single image/video NOTE ################## # record image position for further dropping image_index = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() assert len(image_index) == 1, f"Only support singe/video: {image_index}" if image_index == []: first_image_token_position.append(-1) else: first_image_token_position.append(image_index[0]) # record input instruction length in inference mode if not self.training: if image_index == []: assert num_images == 0, num_images else: assert num_images == 1, f"num_images={num_images}, not support" text_prompt_lens.append(cur_input_ids.shape[0] - num_images) # consider image place holder ############################################### # rank0_print(f"num_images={num_images}") if num_images == 0: cur_image_features = image_features[cur_image_idx] cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids) cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0) new_input_embeds.append(cur_input_embeds) new_labels.append(labels[batch_idx]) cur_image_idx += 1 continue image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]] cur_input_ids_noim = [] cur_labels = labels[batch_idx] cur_labels_noim = [] for i in range(len(image_token_indices) - 1): cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1 : image_token_indices[i + 1]]) cur_labels_noim.append(cur_labels[image_token_indices[i] + 1 : image_token_indices[i + 1]]) split_sizes = [x.shape[0] for x in cur_labels_noim] cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim)) cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0) cur_new_input_embeds = [] cur_new_labels = [] for i in range(num_images + 1): cur_new_input_embeds.append(cur_input_embeds_no_im[i]) cur_new_labels.append(cur_labels_noim[i]) if i < num_images: try: cur_image_features = image_features[cur_image_idx] except IndexError: rank0_print(f"cur_image_idx={cur_image_idx} is not ok") cur_image_features = image_features[cur_image_idx - 1] cur_image_idx += 1 cur_new_input_embeds.append(cur_image_features) cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype)) cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds] # import pdb; pdb.set_trace() cur_new_input_embeds = torch.cat(cur_new_input_embeds) cur_new_labels = torch.cat(cur_new_labels) new_input_embeds.append(cur_new_input_embeds) new_labels.append(cur_new_labels) if mm_llm_compress: self.model.first_image_token_position = first_image_token_position self.model.text_prompt_lens = text_prompt_lens self.model.num_image_token_lens = [image_feature.shape[0] for image_feature in image_features] # Truncate sequences to max length as image embeddings can make the sequence longer tokenizer_model_max_length = getattr(self.config, "tokenizer_model_max_length", None) # rank_print("Finishing Inserting") new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)] new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)] # Combine them max_len = max(x.shape[0] for x in new_input_embeds) batch_size = len(new_input_embeds) new_input_embeds_padded = [] new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device) attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device) position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device) # rank0_print("Prepare pos id") for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)): cur_len = cur_new_embed.shape[0] if getattr(self.config, "tokenizer_padding_side", "right") == "left": new_input_embeds_padded.append(torch.cat((torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device), cur_new_embed), dim=0)) if cur_len > 0: new_labels_padded[i, -cur_len:] = cur_new_labels attention_mask[i, -cur_len:] = True position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device) else: new_input_embeds_padded.append(torch.cat((cur_new_embed, torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0)) if cur_len > 0: new_labels_padded[i, :cur_len] = cur_new_labels attention_mask[i, :cur_len] = True position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device) new_input_embeds = torch.stack(new_input_embeds_padded, dim=0) # rank0_print("tokenizer padding") if _labels is None: new_labels = None else: new_labels = new_labels_padded if _attention_mask is None: attention_mask = None else: attention_mask = attention_mask.to(dtype=_attention_mask.dtype) if _position_ids is None: position_ids = None if getattr(self.config, "use_pos_skipping", False) and self.training: position_ids = torch.arange(new_input_embeds.size(1), device=new_input_embeds.device).unsqueeze(0).to(new_input_embeds.device) split_position = random.randint(0, new_input_embeds.size(1)) left_add = random.randint(0, self.config.pos_skipping_range) right_add = random.randint(left_add, self.config.pos_skipping_range) position_ids[:, :split_position] += left_add position_ids[:, split_position:] += right_add # import pdb; pdb.set_trace() # print("Finish preparing") return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels def initialize_vision_tokenizer(self, model_args, tokenizer): if model_args.mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) self.resize_token_embeddings(len(tokenizer)) if model_args.mm_use_im_start_end: num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) self.resize_token_embeddings(len(tokenizer)) if num_new_tokens > 0: input_embeddings = self.get_input_embeddings().weight.data output_embeddings = self.get_output_embeddings().weight.data input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) input_embeddings[-num_new_tokens:] = input_embeddings_avg output_embeddings[-num_new_tokens:] = output_embeddings_avg if model_args.tune_mm_mlp_adapter: for p in self.get_input_embeddings().parameters(): p.requires_grad = True for p in self.get_output_embeddings().parameters(): p.requires_grad = False if model_args.pretrain_mm_mlp_adapter: mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location="cpu") embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"] assert num_new_tokens == 2 if input_embeddings.shape == embed_tokens_weight.shape: input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:] elif embed_tokens_weight.shape[0] == num_new_tokens: input_embeddings[-num_new_tokens:] = embed_tokens_weight else: raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.") elif model_args.mm_use_im_patch_token: if model_args.tune_mm_mlp_adapter: for p in self.get_input_embeddings().parameters(): p.requires_grad = False for p in self.get_output_embeddings().parameters(): p.requires_grad = False ================================================ FILE: llava-train_videochat/llava/model/make_delta.py ================================================ """ Usage: python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta """ import argparse import torch from tqdm import tqdm from transformers import AutoTokenizer, AutoModelForCausalLM from llava.model.utils import auto_upgrade def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id): print("Loading base model") base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True) print("Loading target model") auto_upgrade(target_model_path) target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True) print("Calculating delta") for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"): if name not in base.state_dict(): assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model" continue if param.data.shape == base.state_dict()[name].shape: param.data -= base.state_dict()[name] else: assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}" bparam = base.state_dict()[name] param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam print("Saving delta") if hub_repo_id: kwargs = {"push_to_hub": True, "repo_id": hub_repo_id} else: kwargs = {} target.save_pretrained(delta_path, **kwargs) target_tokenizer = AutoTokenizer.from_pretrained(target_model_path) target_tokenizer.save_pretrained(delta_path, **kwargs) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--base-model-path", type=str, required=True) parser.add_argument("--target-model-path", type=str, required=True) parser.add_argument("--delta-path", type=str, required=True) parser.add_argument("--hub-repo-id", type=str, default=None) args = parser.parse_args() make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id) ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/builder.py ================================================ import os from .clip_encoder import CLIPVisionTower from .siglip_encoder import SigLipVisionTower from .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2 from .umt_encoder import UMTVisionTower from .internvideo2_encoder import InternVideo2VisionTower # from .eva_clip.eva_clip_encoder import EvaClipVisionTower # from .dev_eva_clip.eva_vit import EvaViTWrapper def build_vision_tower(vision_tower_cfg, **kwargs): vision_tower = getattr(vision_tower_cfg, "mm_vision_tower", getattr(vision_tower_cfg, "vision_tower", None)) # is_absolute_path_exists = os.path.exists(vision_tower) # NOTE sb code! use_s2 = getattr(vision_tower_cfg, "s2", False) if 'clip-vit' in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower: if use_s2: return CLIPVisionTowerS2(vision_tower, args=vision_tower_cfg, **kwargs) else: return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs) elif "siglip" in vision_tower: return SigLipVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs) elif "internvideo2" in vision_tower: return InternVideo2VisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=224, **kwargs) elif "umt-hd" in vision_tower: return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=448, **kwargs) elif "umt" in vision_tower: return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs) else: raise ValueError(f"Unknown vision tower: {vision_tower}") ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/clip_encoder.py ================================================ import torch import torch.nn as nn from llava.utils import rank0_print from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig try: from s2wrapper import forward as multiscale_forward except: pass class CLIPVisionTower(nn.Module): def __init__(self, vision_tower, args, delay_load=False): super().__init__() self.is_loaded = False self.vision_tower_name = vision_tower self.select_layer = args.mm_vision_select_layer self.select_feature = getattr(args, "mm_vision_select_feature", "patch") if not delay_load: rank0_print(f"Loading vision tower: {vision_tower}") self.load_model() elif getattr(args, "unfreeze_mm_vision_tower", False): # TODO: better detector is needed. rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.") self.load_model() elif hasattr(args, "mm_tunable_parts") and "mm_vision_tower" in args.mm_tunable_parts: rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.") self.load_model() else: self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name) def load_model(self, device_map=None): if self.is_loaded: rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name)) return self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name) self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map) self.vision_tower.requires_grad_(False) self.is_loaded = True def feature_select(self, image_forward_outs): select_feature_type = self.select_feature if self.select_feature in ["slicefour_patch", "slicefour_cls_patch"]: select_every_k_layer = len(image_forward_outs.hidden_states) // 4 image_features = torch.cat([image_forward_outs.hidden_states[i] for i in range(select_every_k_layer + self.select_layer, len(image_forward_outs.hidden_states), select_every_k_layer)], dim=-1) select_feature_type = select_feature_type.replace("slicefour_", "") elif self.select_feature in ["slice_m25811_f6_patch", "slice_m25811_f6_cls_patch"]: select_layers = [-2, -5, -8, -11, 6] image_features = torch.cat([image_forward_outs.hidden_states[i] for i in select_layers], dim=-1) select_feature_type = select_feature_type.replace("slice_m25811_f6_", "") else: image_features = image_forward_outs.hidden_states[self.select_layer] if select_feature_type == "patch": image_features = image_features[:, 1:] elif select_feature_type == "cls_patch": image_features = image_features else: raise ValueError(f"Unexpected select feature: {select_feature_type}") return image_features def forward(self, images): if type(images) is list: image_features = [] for image in images: image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True) image_feature = self.feature_select(image_forward_out).to(image.dtype) image_features.append(image_feature) else: image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) image_features = self.feature_select(image_forward_outs).to(images.dtype) return image_features @property def dummy_feature(self): return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype) @property def dtype(self): return self.vision_tower.dtype @property def device(self): return self.vision_tower.device @property def config(self): if self.is_loaded: return self.vision_tower.config else: return self.cfg_only @property def hidden_size(self): _hidden_size = self.config.hidden_size if "slicefour" in self.select_feature: _hidden_size *= 4 if "slice_m25811_f6" in self.select_feature: _hidden_size *= 5 return _hidden_size @property def num_patches_per_side(self): return self.config.image_size // self.config.patch_size @property def num_patches(self): _num_patches = (self.config.image_size // self.config.patch_size) ** 2 if "cls_patch" in self.select_feature: _num_patches += 1 return _num_patches @property def image_size(self): return self.config.image_size class CLIPVisionTowerS2(CLIPVisionTower): def __init__(self, vision_tower, args, delay_load=False): self.s2_scales = getattr(args, "s2_scales", "336,672,1008") self.s2_scales = list(map(int, self.s2_scales.split(","))) self.s2_scales.sort() self.s2_split_size = self.s2_scales[0] self.s2_image_size = self.s2_scales[-1] super().__init__(vision_tower, args, delay_load) # change resize/crop size in preprocessing to the largest image size in s2_scale if not delay_load or getattr(args, "unfreeze_mm_vision_tower", False): self.image_processor.size["shortest_edge"] = self.s2_image_size self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size def load_model(self, device_map=None): if self.is_loaded: rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name)) return self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name) self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map) self.vision_tower.requires_grad_(False) self.image_processor.size["shortest_edge"] = self.s2_image_size self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size self.is_loaded = True def forward_feature(self, images): image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) image_features = self.feature_select(image_forward_outs).to(images.dtype) return image_features def forward(self, images): if type(images) is list: image_features = [] for image in images: image_feature = multiscale_forward(self.forward_feature, image.unsqueeze(0), img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True) image_features.append(image_feature) else: image_features = multiscale_forward(self.forward_feature, images, img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True) return image_features @property def hidden_size(self): return self.config.hidden_size * len(self.s2_scales) ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/__init__.py ================================================ from .vit_scale_clean import PretrainVisionTransformer_clean ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/flash_attention_class.py ================================================ import torch import torch.nn as nn from einops import rearrange from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func from flash_attn.bert_padding import unpad_input, pad_input class FlashAttention(nn.Module): """Implement the scaled dot product attention with softmax. Arguments --------- softmax_scale: The temperature to use for the softmax attention. (default: 1/sqrt(d_keys) where d_keys is computed at runtime) attention_dropout: The dropout rate to apply to the attention (default: 0.0) """ def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None): super().__init__() self.softmax_scale = softmax_scale self.dropout_p = attention_dropout def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None, max_s=None, need_weights=False): """Implements the multihead softmax attention. Arguments --------- qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None if unpadded: (nnz, 3, h, d) key_padding_mask: a bool tensor of shape (B, S) """ assert not need_weights assert qkv.dtype in [torch.float16, torch.bfloat16] assert qkv.is_cuda if cu_seqlens is None: batch_size = qkv.shape[0] seqlen = qkv.shape[1] if key_padding_mask is None: qkv = rearrange(qkv, 'b s ... -> (b s) ...') max_s = seqlen cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32, device=qkv.device) output = flash_attn_varlen_qkvpacked_func( qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0, softmax_scale=self.softmax_scale, causal=causal ) output = rearrange(output, '(b s) ... -> b s ...', b=batch_size) else: nheads = qkv.shape[-2] x = rearrange(qkv, 'b s three h d -> b s (three h d)') x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask) x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads) output_unpad = flash_attn_varlen_qkvpacked_func( x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0, softmax_scale=self.softmax_scale, causal=causal ) output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices, batch_size, seqlen), 'b s (h d) -> b s h d', h=nheads) else: assert max_s is not None output = flash_attn_varlen_qkvpacked_func( qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0, softmax_scale=self.softmax_scale, causal=causal ) return output, None ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/pos_embed.py ================================================ import numpy as np import torch import logging logger = logging.getLogger(__name__) # -------------------------------------------------------- # 3D sine-cosine position embedding # References: # MVD: https://github.com/ruiwang2021/mvd/blob/main/modeling_finetune.py # -------------------------------------------------------- def get_3d_sincos_pos_embed(embed_dim, grid_size, t_size, cls_token=False): """ grid_size: int of the grid height and width t_size: int of the temporal size return: pos_embed: [t_size*grid_size*grid_size, embed_dim] or [1+t_size*grid_size*grid_size, embed_dim] (w/ or w/o cls_token) """ assert embed_dim % 4 == 0 embed_dim_spatial = embed_dim // 4 * 3 embed_dim_temporal = embed_dim // 4 # spatial grid_h = np.arange(grid_size, dtype=np.float32) grid_w = np.arange(grid_size, dtype=np.float32) grid = np.meshgrid(grid_w, grid_h) # here w goes first grid = np.stack(grid, axis=0) grid = grid.reshape([2, 1, grid_size, grid_size]) pos_embed_spatial = get_2d_sincos_pos_embed_from_grid( embed_dim_spatial, grid ) # temporal grid_t = np.arange(t_size, dtype=np.float32) pos_embed_temporal = get_1d_sincos_pos_embed_from_grid( embed_dim_temporal, grid_t ) # concate: [T, H, W] order pos_embed_temporal = pos_embed_temporal[:, np.newaxis, :] pos_embed_temporal = np.repeat( pos_embed_temporal, grid_size**2, axis=1 ) # [T, H*W, D // 4] pos_embed_spatial = pos_embed_spatial[np.newaxis, :, :] pos_embed_spatial = np.repeat( pos_embed_spatial, t_size, axis=0 ) # [T, H*W, D // 4 * 3] pos_embed = np.concatenate([pos_embed_temporal, pos_embed_spatial], axis=-1) pos_embed = pos_embed.reshape([-1, embed_dim]) # [T*H*W, D] if cls_token: pos_embed = np.concatenate( [np.zeros([1, embed_dim]), pos_embed], axis=0 ) return pos_embed # -------------------------------------------------------- # 2D sine-cosine position embedding # References: # Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py # MoCo v3: https://github.com/facebookresearch/moco-v3 # -------------------------------------------------------- def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False): """ grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token) """ grid_h = np.arange(grid_size, dtype=np.float32) grid_w = np.arange(grid_size, dtype=np.float32) grid = np.meshgrid(grid_w, grid_h) # here w goes first grid = np.stack(grid, axis=0) grid = grid.reshape([2, 1, grid_size, grid_size]) pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid) if cls_token: pos_embed = np.concatenate( [np.zeros([1, embed_dim]), pos_embed], axis=0 ) return pos_embed def get_1d_sincos_pos_embed(embed_dim, t_size, cls_token=False): """ t_size: int of the temporal size return: pos_embed: [t_size, embed_dim] or [1+t_size, embed_dim] (w/ or w/o cls_token) """ grid_t = np.arange(t_size, dtype=np.float32) pos_embed = get_1d_sincos_pos_embed_from_grid(embed_dim, grid_t) if cls_token: pos_embed = np.concatenate( [np.zeros([1, embed_dim]), pos_embed], axis=0 ) return pos_embed def get_2d_sincos_pos_embed_from_grid(embed_dim, grid): assert embed_dim % 2 == 0 # use half of dimensions to encode grid_h emb_h = get_1d_sincos_pos_embed_from_grid( embed_dim // 2, grid[0] ) # (H*W, D/2) emb_w = get_1d_sincos_pos_embed_from_grid( embed_dim // 2, grid[1] ) # (H*W, D/2) emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D) return emb def get_1d_sincos_pos_embed_from_grid(embed_dim, pos): """ embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D) """ assert embed_dim % 2 == 0 omega = np.arange(embed_dim // 2, dtype=np.float32) omega /= embed_dim / 2.0 omega = 1.0 / 10000**omega # (D/2,) pos = pos.reshape(-1) # (M,) out = np.einsum("m,d->md", pos, omega) # (M, D/2), outer product emb_sin = np.sin(out) # (M, D/2) emb_cos = np.cos(out) # (M, D/2) emb = np.concatenate([emb_sin, emb_cos], axis=1) # (M, D) return emb def interpolate_pos_embed_internvideo2(checkpoint_model, model, orig_t_size = 8): # interpolate position embedding for pos_name in ['pos_embed', 'clip_pos_embed']: if pos_name in checkpoint_model: pos_embed_checkpoint = checkpoint_model[pos_name] embedding_size = pos_embed_checkpoint.shape[-1] # channel dim num_patches = model.patch_embed.num_patches # num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1 # we use 8 frames for pretraining # new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size new_t_size = model.num_frames // model.tubelet_size # height (== width) for the checkpoint position embedding orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5) # height (== width) for the new position embedding new_size = int((num_patches // (new_t_size))** 0.5) # class_token and dist_token are kept unchanged if orig_t_size != new_t_size: logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})") print(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})") extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens] # only the position tokens are interpolated pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:] # B, L, C -> B, T, HW, C -> BHW, C, T (B = 1) pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size) pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size) pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear') pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size) pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size) new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1) checkpoint_model[pos_name] = new_pos_embed pos_embed_checkpoint = new_pos_embed # class_token and dist_token are kept unchanged if orig_size != new_size: logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})") print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})") extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens] # only the position tokens are interpolated pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:] # B, L, C -> BT, H, W, C -> BT, C, H, W pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size) pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2) pos_tokens = torch.nn.functional.interpolate( pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False) # BT, C, H, W -> BT, H, W, C -> B, T, H, W, C pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size) pos_tokens = pos_tokens.flatten(1, 3) # B, L, C new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1) checkpoint_model[pos_name] = new_pos_embed for pos_name in ['img_pos_embed']: if pos_name in checkpoint_model: pos_embed_checkpoint = checkpoint_model[pos_name] embedding_size = pos_embed_checkpoint.shape[-1] # channel dim num_patches = model.patch_embed.num_img_patches # num_extra_tokens = model.pos_embed.shape[-2] - model.patch_embed.num_patches # 0/1 # we use 8 frames for pretraining # new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size # height (== width) for the checkpoint position embedding orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)) ** 0.5) # height (== width) for the new position embedding new_size = int((num_patches)** 0.5) # class_token and dist_token are kept unchanged if orig_size != new_size: logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})") print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})") extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens] # only the position tokens are interpolated pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:] # B, L, C -> B, H, W, C -> B, C, H, W pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2) pos_tokens = torch.nn.functional.interpolate( pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False) # BT, C, H, W -> BT, H, W, C -> B, T, H, W, C pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, 1, new_size, new_size, embedding_size) pos_tokens = pos_tokens.flatten(1, 3) # B, L, C new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1) checkpoint_model[pos_name] = new_pos_embed if 'pos_embed_spatial' in checkpoint_model or 'pos_embed_temporal' in checkpoint_model: raise NotImplementedError def interpolate_pos_embed_internvideo2_new(checkpoint_model, model, orig_t_size = 8): pos_names = [] for k in checkpoint_model.keys(): if ('pos_embed' in k or 'clip_pos_embed' in k) and 'img' not in k: pos_names.append(k) assert len(pos_names) > 0, checkpoint_model.keys() if 'pos_embed_spatial' in checkpoint_model.keys() or 'pos_embed_temporal' in checkpoint_model.keys(): raise NotImplementedError # interpolate position embedding for pos_name in pos_names: pos_embed_checkpoint = checkpoint_model[pos_name] embedding_size = pos_embed_checkpoint.shape[-1] # channel dim num_patches = model.patch_embed.num_patches # num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1 # we use 8 frames for pretraining # new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size new_t_size = model.num_frames // model.tubelet_size # height (== width) for the checkpoint position embedding orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5) # height (== width) for the new position embedding new_size = int((num_patches // (new_t_size))** 0.5) # class_token and dist_token are kept unchanged if orig_t_size != new_t_size: logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})") extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens] # only the position tokens are interpolated pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:] # B, L, C -> B, T, HW, C -> BHW, C, T (B = 1) pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size) pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size) pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear') pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size) pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size) new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1) checkpoint_model[pos_name] = new_pos_embed pos_embed_checkpoint = new_pos_embed # class_token and dist_token are kept unchanged if orig_size != new_size: logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})") extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens] # only the position tokens are interpolated pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:] # B, L, C -> BT, H, W, C -> BT, C, H, W pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size) pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2) pos_tokens = torch.nn.functional.interpolate( pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False) # BT, C, H, W -> BT, H, W, C -> B, T, H, W, C pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size) pos_tokens = pos_tokens.flatten(1, 3) # B, L, C new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1) checkpoint_model[pos_name] = new_pos_embed ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/vit_scale_clean.py ================================================ import math import logging import torch import torch.nn.functional as F from timm.models.layers import DropPath, to_2tuple, trunc_normal_ from torch import nn import torch.utils.checkpoint as checkpoint from functools import partial from einops import rearrange from .pos_embed import get_3d_sincos_pos_embed, get_2d_sincos_pos_embed, get_1d_sincos_pos_embed, interpolate_pos_embed_internvideo2 from .flash_attention_class import FlashAttention logger = logging.getLogger(__name__) class CrossAttention(nn.Module): def __init__( self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., attn_head_dim=None, out_dim=None): super().__init__() if out_dim is None: out_dim = dim self.num_heads = num_heads head_dim = dim // num_heads if attn_head_dim is not None: head_dim = attn_head_dim all_head_dim = head_dim * self.num_heads self.scale = qk_scale or head_dim ** -0.5 assert all_head_dim == dim self.q = nn.Linear(dim, all_head_dim, bias=False) self.k = nn.Linear(dim, all_head_dim, bias=False) self.v = nn.Linear(dim, all_head_dim, bias=False) if qkv_bias: self.q_bias = nn.Parameter(torch.zeros(all_head_dim)) self.k_bias = nn.Parameter(torch.zeros(all_head_dim)) self.v_bias = nn.Parameter(torch.zeros(all_head_dim)) else: self.q_bias = None self.k_bias = None self.v_bias = None self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(all_head_dim, out_dim) self.proj_drop = nn.Dropout(proj_drop) def forward(self, x, k=None, v=None): B, N, C = x.shape N_k = k.shape[1] N_v = v.shape[1] q_bias, k_bias, v_bias = None, None, None if self.q_bias is not None: q_bias = self.q_bias k_bias = self.k_bias v_bias = self.v_bias q = F.linear(input=x, weight=self.q.weight, bias=q_bias) q = q.reshape(B, N, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) # (B, N_head, N_q, dim) k = F.linear(input=k, weight=self.k.weight, bias=k_bias) k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) v = F.linear(input=v, weight=self.v.weight, bias=v_bias) v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) q = q * self.scale attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k) attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, -1) x = self.proj(x) x = self.proj_drop(x) return x class AttentiveBlock(nn.Module): def __init__(self, dim, num_heads, qkv_bias=False, qk_scale=None, drop=0., attn_drop=0., drop_path=0., norm_layer=nn.LayerNorm, attn_head_dim=None, out_dim=None): super().__init__() self.norm1_q = norm_layer(dim) self.norm1_k = norm_layer(dim) self.norm1_v = norm_layer(dim) self.cross_attn = CrossAttention( dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim, out_dim=out_dim) self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity() def forward(self, x_q, x_kv, pos_q, pos_k, bool_masked_pos, rel_pos_bias=None): x_q = self.norm1_q(x_q + pos_q) x_k = self.norm1_k(x_kv + pos_k) x_v = self.norm1_v(x_kv) x = self.cross_attn(x_q, k=x_k, v=x_v) return x class AttentionPoolingBlock(AttentiveBlock): def forward(self, x): x_q = x.mean(1, keepdim=True) x_kv, pos_q, pos_k = x, 0, 0 x = super().forward(x_q, x_kv, pos_q, pos_k, bool_masked_pos=None, rel_pos_bias=None) x = x.squeeze(1) return x class RMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) class LayerScale(nn.Module): def __init__(self, dim, init_values=1e-5, inplace=False, force_fp32=False): super().__init__() self.inplace = inplace self.weight = nn.Parameter(init_values * torch.ones(dim)) self.force_fp32 = force_fp32 @torch.cuda.amp.autocast(enabled=False) def forward(self, x): if self.force_fp32: output_type = x.dtype out = x.float().mul_(self.weight.float()) if self.inplace else x.float() * self.weight.float() return out.to(dtype=output_type) else: out = x.mul_(self.weight) if self.inplace else x * self.weight return out class Attention(nn.Module): def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0., use_flash_attn=False, causal=False, norm_layer=nn.LayerNorm, qk_normalization=False, use_fused_rmsnorm=False): super().__init__() assert dim % num_heads == 0, 'dim should be divisible by num_heads' self.num_heads = num_heads head_dim = dim // num_heads self.scale = head_dim ** -0.5 self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias) self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(dim, dim) self.proj_drop = nn.Dropout(proj_drop) self.use_flash_attn = use_flash_attn if use_flash_attn: self.causal = causal self.inner_attn = FlashAttention(attention_dropout=attn_drop) self.qk_normalization = qk_normalization self.q_norm = norm_layer(dim) if qk_normalization else nn.Identity() self.k_norm = norm_layer(dim) if qk_normalization else nn.Identity() self.use_fused_rmsnorm = use_fused_rmsnorm def _naive_attn(self, x): B, N, C = x.shape # print(x.shape, torch.cuda.memory_allocated(), torch.cuda.memory_allocated()) qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple) if self.qk_normalization: B_, H_, N_, D_ = q.shape q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2) k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2) attn = ((q * self.scale) @ k.transpose(-2, -1)) # attn = attn - attn.max(-1)[0].unsqueeze(-1) # in case of overflow for fp16 attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) # print(torch.cuda.memory_allocated(), torch.cuda.memory_allocated()) x = (attn @ v).transpose(1, 2).reshape(B, N, C) x = self.proj(x) x = self.proj_drop(x) return x def _flash_attn(self, x, key_padding_mask=None, need_weights=False): qkv = self.qkv(x) qkv = rearrange(qkv, "b s (three h d) -> b s three h d", three=3, h=self.num_heads) if self.qk_normalization: q, k, v = qkv.unbind(2) if self.use_fused_rmsnorm: q = self.q_norm(q.flatten(-2, -1))[0].view(q.shape) k = self.k_norm(k.flatten(-2, -1))[0].view(k.shape) else: q = self.q_norm(q.flatten(-2, -1)).view(q.shape) k = self.k_norm(k.flatten(-2, -1)).view(k.shape) qkv = torch.stack([q, k, v], dim=2) context, _ = self.inner_attn( qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=self.causal ) outs = self.proj(rearrange(context, "b s h d -> b s (h d)")) outs = self.proj_drop(outs) return outs def forward(self, x): x = self._naive_attn(x) if not self.use_flash_attn else self._flash_attn(x) return x class Mlp(nn.Module): """ MLP as used in Vision Transformer, MLP-Mixer and related networks """ def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, bias=True, drop=0.): super().__init__() out_features = out_features or in_features hidden_features = hidden_features or in_features bias = to_2tuple(bias) drop_probs = to_2tuple(drop) self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0]) self.act = act_layer() self.drop1 = nn.Dropout(drop_probs[0]) self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1]) self.drop2 = nn.Dropout(drop_probs[1]) def forward(self, x): x = self.fc1(x) x = self.act(x) x = self.drop1(x) x = self.fc2(x) x = self.drop2(x) return x class Block(nn.Module): def __init__( self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., init_values=None, drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_flash_attn=False, use_fused_mlp=False, fused_mlp_heuristic=1, with_cp=False, qk_normalization=False, layerscale_no_force_fp32=False, use_fused_rmsnorm=False): super().__init__() self.norm1 = norm_layer(dim) self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop, use_flash_attn=use_flash_attn, causal=False, norm_layer=norm_layer, qk_normalization=qk_normalization, use_fused_rmsnorm=use_fused_rmsnorm) self.ls1 = LayerScale(dim, init_values=init_values, force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity() # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity() self.norm2 = norm_layer(dim) mlp_hidden_dim = int(dim * mlp_ratio) if use_fused_mlp: raise NotImplementedError self.mlp = FusedMLP(in_features=dim, hidden_features=mlp_hidden_dim, heuristic=fused_mlp_heuristic) else: self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop) self.ls2 = LayerScale(dim, init_values=init_values, force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity() self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity() self.with_cp = with_cp self.use_fused_rmsnorm = use_fused_rmsnorm def forward(self, x, residual=None): def _inner_forward(x, residual=None): if self.use_fused_rmsnorm: x, residual = self.norm1(x, residual) x = self.drop_path1(self.ls1(self.attn(x))) x, residual = self.norm2(x, residual) x = self.drop_path2(self.ls2(self.mlp(x))) return x, residual else: assert residual is None x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x)))) x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x)))) return x if self.with_cp: # print(f"\033[31m use_checkpoint [0m") return checkpoint.checkpoint(_inner_forward, x, residual) else: return _inner_forward(x, residual=residual) class PatchEmbed(nn.Module): """ 3D Image to Patch Embedding """ def __init__( self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=8, tubelet_size=1, norm_layer=None ): super().__init__() img_size = to_2tuple(img_size) patch_size = to_2tuple(patch_size) self.img_size = img_size self.patch_size = patch_size self.grid_size = ( num_frames // tubelet_size, img_size[0] // patch_size[0], img_size[1] // patch_size[1] ) # (T, H, W) self.num_patches = self.grid_size[0] * self.grid_size[1] * self.grid_size[2] self.num_img_patches = self.grid_size[1] * self.grid_size[2] self.proj = nn.Conv3d( in_channels=in_chans, out_channels=embed_dim, kernel_size=(tubelet_size, patch_size[0], patch_size[1]), stride=(tubelet_size, patch_size[0], patch_size[1]) ) self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity() def forward(self, x): x = self.proj(x) x = x.flatten(3).permute(0, 2, 3, 1) # B x C x T x HW => B x T x HW x C x = self.norm(x) return x class PretrainVisionTransformer_clean(nn.Module): def __init__( self, in_chans: int = 3, patch_size: int = 14, img_size: int = 224, qkv_bias: bool = False, # follow internvl_clip to set False drop_path_rate: float = 0.25, # may need ablation embed_dim: int = 1408, num_heads: int = 16, mlp_ratio: float = 48/11, init_values: float = 1e-5, # may need ablation qk_normalization: bool = True, depth: int = 40, use_flash_attn: bool = True, use_fused_rmsnorm: bool = True, use_fused_mlp: bool = True, fused_mlp_heuristic: int = 1, attn_pool_num_heads: int = 16, clip_embed_dim: int = 768, layerscale_no_force_fp32: bool = False, # whether True for training? num_frames: int = 8, tubelet_size: int = 1, sep_pos_embed: bool = False, sep_image_video_pos_embed: bool = False, use_checkpoint: bool = False, checkpoint_num: int = 0, # for unmasked teacher x_vis_return_idx=-1, x_vis_only=False ): super().__init__() self.num_frames = num_frames self.tubelet_size = tubelet_size # assert use_flash_attn == use_fused_rmsnorm == use_fused_mlp, f'use_flash_attn:{use_flash_attn}, use_fused_rmsnorm{use_fused_rmsnorm} and use_fused_mlp{use_fused_mlp} should be consistent' self.use_flash_attn = use_flash_attn self.embed_dim = embed_dim logger.info(f"Origin depth: {depth}") depth = depth + x_vis_return_idx + 1 logger.info(f"New depth: {depth}") self.depth = depth self.x_vis_only = x_vis_only if use_fused_rmsnorm: raise NotImplementedError norm_layer_for_blocks = partial(DropoutAddRMSNorm, eps=1e-6, prenorm=True) else: norm_layer_for_blocks = partial(RMSNorm, eps=1e-6) self.norm_layer_for_blocks = norm_layer_for_blocks self.patch_embed = PatchEmbed( img_size, patch_size, in_chans, embed_dim, num_frames=num_frames, tubelet_size=tubelet_size, ) num_patches = self.patch_embed.num_patches num_img_patches = self.patch_embed.num_img_patches # print(f"num_patches: {num_patches}, num_img_patches: {num_img_patches}") self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) # stolen from https://github.com/facebookresearch/mae_st/blob/dc072aaaf640d06892e23a33b42223a994efe272/models_vit.py#L65-L73C17 self.sep_pos_embed = sep_pos_embed self.sep_image_video_pos_embed = sep_image_video_pos_embed if sep_pos_embed: raise NotImplementedError else: if sep_image_video_pos_embed: logger.info("Use separate position embedding, for image and video we use different pos_embed.") self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) self.img_pos_embed = nn.Parameter(torch.zeros(1, num_img_patches + 1, embed_dim)) else: logger.info("Use joint position embedding, for image and video we use same pos_embed.") self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # choose which layer to use checkpoint with_cp_list = [False] * depth if use_checkpoint: for idx in range(depth): if idx < checkpoint_num: with_cp_list[idx] = True logger.info(f"Droppath rate: {dpr}") logger.info(f"Checkpoint list: {with_cp_list}") self.blocks = nn.ModuleList([ Block(embed_dim, num_heads, mlp_ratio, qkv_bias=qkv_bias, norm_layer=norm_layer_for_blocks, drop_path=dpr[i], init_values=init_values, attn_drop=0., use_flash_attn=use_flash_attn, use_fused_mlp=use_fused_mlp, fused_mlp_heuristic=fused_mlp_heuristic, with_cp=with_cp_list[i], qk_normalization=qk_normalization, layerscale_no_force_fp32=layerscale_no_force_fp32, use_fused_rmsnorm=use_fused_rmsnorm) for i in range(depth)]) if not self.x_vis_only: self.clip_projector = AttentionPoolingBlock( dim=embed_dim, num_heads=attn_pool_num_heads, qkv_bias=True, qk_scale=None, drop=0., attn_drop=0., norm_layer=partial(nn.LayerNorm, eps=1e-5), out_dim=clip_embed_dim) self.init_pos_embed() trunc_normal_(self.cls_token, std=.02) # NOTE 对chat没用,都要加载预训练的 self.apply(self._init_weights) self.fix_init_weight() def init_pos_embed(self): logger.info("Init pos_embed from sincos pos_embed") if self.sep_pos_embed: raise NotImplementedError else: pos_embed = get_3d_sincos_pos_embed( self.pos_embed.shape[-1], self.patch_embed.grid_size[1], # height & weight self.patch_embed.grid_size[0], # t_size cls_token=True ) self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0)) if self.sep_image_video_pos_embed: img_pos_embed = get_3d_sincos_pos_embed( self.pos_embed.shape[-1], self.patch_embed.grid_size[1], # height & weight 1, cls_token=True ) self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0)) def _init_weights(self, m): if isinstance(m, nn.Linear): trunc_normal_(m.weight, std=.02) if isinstance(m, nn.Linear) and m.bias is not None: nn.init.constant_(m.bias, 0) elif isinstance(m, nn.LayerNorm): nn.init.constant_(m.bias, 0) nn.init.constant_(m.weight, 1.0) def fix_init_weight(self): def rescale(param, layer_id): param.div_(math.sqrt(2.0 * layer_id)) for layer_id, layer in enumerate(self.blocks): rescale(layer.attn.proj.weight.data, layer_id + 1) rescale(layer.mlp.fc2.weight.data, layer_id + 1) @property def dtype(self): return self.patch_embed.proj.weight.dtype def get_num_layers(self): return len(self.blocks) @torch.jit.ignore def no_weight_decay(self): return { 'pos_embed', 'pos_embed_spatial', 'pos_embed_temporal', 'pos_embed_cls', 'img_pos_embed', 'cls_token' } # @torch.cuda.amp.autocast(enabled=False) def forward(self, x, mask=None, use_image=False): x = self.patch_embed(x.type(self.dtype)) # print(f"x.shape: {x.shape} x.dtype: {x.dtype}, model.dtype: {self.dtype}") B, T, L, C = x.shape # T: temporal; L: spatial x = x.view([B, T * L, C]) # append cls token cls_tokens = self.cls_token.expand(B, -1, -1) x = torch.cat((cls_tokens, x), dim=1) # add pos_embed if self.sep_pos_embed: raise NotImplementedError else: if use_image: if self.sep_image_video_pos_embed: pos_embed = self.img_pos_embed else: # (1, num_img_patches + 1, embed_dim) # print('origin pos_embed.shape:', self.pos_embed.shape) cls_pos_embed = self.pos_embed[:, 0:1, :] # print('cls_pos_embed.shape:', cls_pos_embed.shape) img_pos_embed = self.pos_embed[:, 1:, :].view(1, self.num_frames, self.patch_embed.num_patches // self.num_frames, self.embed_dim).mean(dim=1) # print('img_pos_embed.shape:', img_pos_embed.shape) pos_embed = torch.cat([cls_pos_embed, img_pos_embed], dim=1) # print('final img_pos_embed.shape:', pos_embed.shape) else: pos_embed = self.pos_embed # print("pos_embed.shape:", pos_embed.shape) x = x + pos_embed # mask tokens, ~mask means visible if mask is not None: x = x[~mask].reshape(B, -1, C) else: x = x.reshape(B, -1, C) residual = None for idx, blk in enumerate(self.blocks): if isinstance(x, tuple) and len(x) == 2: x, residual = x x = blk(x, residual=residual) if isinstance(x, tuple) and len(x) == 2: x, residual = x if residual is not None: x = x + residual x_vis = x if self.x_vis_only: return x_vis else: x_pool_vis = self.clip_projector(x_vis) return x_vis, x_pool_vis, None, None def pretrain_internvideo2_giant_patch14_224_clean(config): model = PretrainVisionTransformer_clean( in_chans=3, img_size=224, patch_size=14, embed_dim=1408, depth=40, num_heads=16, mlp_ratio=48/11, clip_embed_dim=config.vision_encoder.clip_embed_dim, attn_pool_num_heads=16, qkv_bias=False, drop_path_rate=0.25, init_values=0.00001, qk_normalization=True, use_flash_attn=config.vision_encoder.get('use_flash_attn', True), use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True), use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True), fused_mlp_heuristic=1, layerscale_no_force_fp32=False, num_frames=config.vision_encoder.num_frames, tubelet_size=config.vision_encoder.tubelet_size, sep_pos_embed=False, sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed, use_checkpoint=config.vision_encoder.use_checkpoint, checkpoint_num=config.vision_encoder.checkpoint_num, x_vis_return_idx=config.vision_encoder.x_vis_return_idx, x_vis_only=config.vision_encoder.x_vis_only, ) if config.vision_encoder.pretrained is not None: logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}") state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu') interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1 message = model.load_state_dict(state_dict, strict=False) logger.info(message) else: logger.info("No pretrained weights!!!") return model def pretrain_internvideo2_6b_patch14_224_clean(config): model = PretrainVisionTransformer_clean( in_chans=3, img_size=224, patch_size=14, embed_dim=3200, depth=48, num_heads=25, mlp_ratio=4, clip_embed_dim=config.vision_encoder.clip_embed_dim, attn_pool_num_heads=16, qkv_bias=False, drop_path_rate=0.3, init_values=0.00001, qk_normalization=True, use_flash_attn=config.vision_encoder.get('use_flash_attn', True), use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True), use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True), fused_mlp_heuristic=1, layerscale_no_force_fp32=False, num_frames=config.vision_encoder.num_frames, tubelet_size=config.vision_encoder.tubelet_size, sep_pos_embed=False, sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed, use_checkpoint=config.vision_encoder.use_checkpoint, checkpoint_num=config.vision_encoder.checkpoint_num, x_vis_return_idx=config.vision_encoder.x_vis_return_idx, x_vis_only=config.vision_encoder.x_vis_only ) if config.vision_encoder.pretrained is not None: logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}") state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu') interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1 msg = model.load_state_dict(state_dict, strict=False) logger.info(msg) else: logger.info("No pretrained weights!!!") return model ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2_encoder.py ================================================ """ # Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py """ from typing import Optional, Tuple, Union, Dict from dataclasses import dataclass from functools import partial, reduce from PIL import Image import torch import torch.utils.checkpoint from torch import nn import os from transformers.image_processing_utils import BatchFeature, get_size_dict from transformers.image_transforms import ( convert_to_rgb, normalize, rescale, resize, to_channel_dimension_format, ) from transformers.image_utils import ( ChannelDimension, PILImageResampling, to_numpy_array, ) from llava.utils import rank0_print from .internvideo2.vit_scale_clean import PretrainVisionTransformer_clean from .internvideo2.vit_scale_clean import interpolate_pos_embed_internvideo2 class InternVideo2ImageProcessor: def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST): crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]} crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") self.image_mean = image_mean self.image_std = image_std self.size = size self.resample = resample self.rescale_factor = rescale_factor self.data_format = data_format self.crop_size = crop_size def preprocess(self, images, return_tensors, target_size=None): if isinstance(images, Image.Image): images = [images] else: # to adapt video data images = [to_numpy_array(image) for image in images] assert isinstance(images, list) if target_size is None: target_size = self.size transforms = [ convert_to_rgb, to_numpy_array, partial(resize, size=target_size, resample=self.resample, data_format=self.data_format), partial(rescale, scale=self.rescale_factor, data_format=self.data_format), partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format), partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format), ] images = reduce(lambda x, f: [*map(f, x)], transforms, images) data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) class InternVideo2VisionConfig: model_type = "internvideo2_vision_model" def __init__( self, num_frames=4, hidden_size=1408, num_hidden_layers=40, num_attention_heads=16, num_channels=3, image_size=224, patch_size=14, x_vis_return_idx=-2, sep_image_video_pos_embed=True, use_checkpoint=True, checkpoint_num=40, # **kwargs, ): # super().__init__(**kwargs) self.num_frames = num_frames self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.num_channels = num_channels self.patch_size = patch_size self.image_size = image_size self.x_vis_return_idx = x_vis_return_idx self.sep_image_video_pos_embed = sep_image_video_pos_embed self.use_checkpoint = use_checkpoint self.checkpoint_num = checkpoint_num def build_vit(config, pt_type='origin'): model = PretrainVisionTransformer_clean( in_chans=config.num_channels, img_size=config.image_size, patch_size=config.patch_size, embed_dim=config.hidden_size, depth=config.num_hidden_layers, num_heads=config.num_attention_heads, mlp_ratio=48/11, # clip_embed_dim=config.vision_encoder.clip_embed_dim, attn_pool_num_heads=16, qkv_bias=False, drop_path_rate=0.25, init_values=0.00001, qk_normalization=True, use_flash_attn=True, use_fused_rmsnorm=False, use_fused_mlp=False, fused_mlp_heuristic=1, layerscale_no_force_fp32=False, num_frames=config.num_frames, tubelet_size=1, sep_pos_embed=False, sep_image_video_pos_embed=config.sep_image_video_pos_embed, use_checkpoint=config.use_checkpoint, checkpoint_num=config.checkpoint_num, x_vis_return_idx=config.x_vis_return_idx, x_vis_only=True ) ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt" if not os.path.isfile(ckpt_path): raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt") state_dict = torch.load(ckpt_path, map_location='cpu') if config.num_frames != 4: raise NotImplementedError # make deepspeed zero3 happy if config.image_size != 224: interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=4) message = model.load_state_dict(state_dict, strict=False) rank0_print(message) return model class InternVideo2VisionTower(nn.Module): def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224): super().__init__() self.is_loaded = False self.pt_type = pt_type self.config = InternVideo2VisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, x_vis_return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size) self.vision_tower_name = vision_tower self.image_processor = InternVideo2ImageProcessor(size=(image_size, image_size)) if not delay_load: rank0_print(f"Loading vision tower: {vision_tower}") self.load_model() elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False): # TODO: better detector is needed. rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.") self.load_model() elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts: rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.") self.load_model() else: raise NotImplementedError self.cfg_only = self.config def load_model(self, device_map=None): if self.is_loaded: rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name)) return self.vision_tower = build_vit(self.config, pt_type=self.pt_type) self.vision_tower.requires_grad_(False) self.is_loaded = True def forward(self, images): if type(images) is list: raise NotImplementedError else: # input: B T C H W # output: B T*L C T = images.shape[1] images = images.permute(0, 2, 1, 3, 4) image_embeds = self.vision_tower(images, use_image=(T == 1)) return image_embeds[:, 1:, :] @property def dummy_feature(self): return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype) @property def dtype(self): for p in self.vision_tower.parameters(): return p.dtype @property def device(self): for p in self.vision_tower.parameters(): return p.device @property def hidden_size(self): return self.config.hidden_size @property def num_patches(self): return (self.config.image_size // self.config.patch_size) ** 2 @property def num_patches_per_side(self): return self.config.image_size // self.config.patch_size # return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"] @property def image_size(self): return self.config.image_size ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/siglip_encoder.py ================================================ """ # Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py """ from typing import Optional, Tuple, Union, Dict from dataclasses import dataclass from functools import partial, reduce from PIL import Image import torch import torch.utils.checkpoint from torch import nn import os from transformers.image_processing_utils import BatchFeature, get_size_dict from transformers.image_transforms import ( convert_to_rgb, normalize, rescale, resize, to_channel_dimension_format, ) from transformers.image_utils import ( ChannelDimension, PILImageResampling, to_numpy_array, ) from transformers.activations import ACT2FN from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling from transformers.modeling_utils import PreTrainedModel from transformers import PretrainedConfig from transformers.utils import ModelOutput from llava.utils import rank0_print class SigLipImageProcessor: def __init__(self, image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5), size=(384, 384), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST): crop_size = crop_size if crop_size is not None else {"height": 384, "width": 384} crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") self.image_mean = image_mean self.image_std = image_std self.size = size self.resample = resample self.rescale_factor = rescale_factor self.data_format = data_format self.crop_size = crop_size def preprocess(self, images, return_tensors): if isinstance(images, Image.Image): images = [images] else: # to adapt video data images = [to_numpy_array(image) for image in images] assert isinstance(images, list) transforms = [ convert_to_rgb, to_numpy_array, partial(resize, size=self.size, resample=self.resample, data_format=self.data_format), partial(rescale, scale=self.rescale_factor, data_format=self.data_format), partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format), partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format), ] images = reduce(lambda x, f: [*map(f, x)], transforms, images) data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) class SigLipVisionConfig(PretrainedConfig): model_type = "siglip_vision_model" def __init__( self, hidden_size=1152, image_mean=(0.5, 0.5, 0.5), intermediate_size=4304, num_hidden_layers=27, num_attention_heads=16, num_channels=3, image_size=384, patch_size=14, hidden_act="gelu_pytorch_tanh", layer_norm_eps=1e-6, attention_dropout=0.0, **kwargs, ): super().__init__(**kwargs) self.hidden_size = hidden_size self.intermediate_size = intermediate_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.num_channels = num_channels self.patch_size = patch_size self.image_size = image_size self.attention_dropout = attention_dropout self.layer_norm_eps = layer_norm_eps self.hidden_act = hidden_act self.image_mean = image_mean @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": cls._set_token_in_kwargs(kwargs) config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from SigLipConfig if config_dict.get("model_type") == "siglip": config_dict = config_dict["vision_config"] if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: print(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.") return cls.from_dict(config_dict, **kwargs) @dataclass # Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->SigLip class SigLipVisionModelOutput(ModelOutput): """ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. Args: image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): The image embeddings obtained by applying the projection layer to the pooler_output. last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. """ image_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None hidden_states: Optional[Tuple[torch.FloatTensor]] = None attentions: Optional[Tuple[torch.FloatTensor]] = None class SigLipVisionEmbeddings(nn.Module): def __init__(self, config: SigLipVisionConfig): super().__init__() self.config = config self.embed_dim = config.hidden_size self.image_size = config.image_size self.patch_size = config.patch_size self.patch_embedding = nn.Conv2d( in_channels=config.num_channels, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size, padding="valid", ) self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] embeddings = patch_embeds.flatten(2).transpose(1, 2) embeddings = embeddings + self.position_embedding(self.position_ids) return embeddings class SigLipAttention(nn.Module): """Multi-headed attention from 'Attention Is All You Need' paper""" # Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__ def __init__(self, config): super().__init__() self.config = config self.embed_dim = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.embed_dim // self.num_heads if self.head_dim * self.num_heads != self.embed_dim: raise ValueError(f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" f" {self.num_heads}).") self.scale = self.head_dim**-0.5 self.dropout = config.attention_dropout self.k_proj = nn.Linear(self.embed_dim, self.embed_dim) self.v_proj = nn.Linear(self.embed_dim, self.embed_dim) self.q_proj = nn.Linear(self.embed_dim, self.embed_dim) self.out_proj = nn.Linear(self.embed_dim, self.embed_dim) def forward( self, hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = False, ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: """Input shape: Batch x Time x Channel""" batch_size, q_len, _ = hidden_states.size() query_states = self.q_proj(hidden_states) key_states = self.k_proj(hidden_states) value_states = self.v_proj(hidden_states) query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2) value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2) k_v_seq_len = key_states.shape[-2] attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len): raise ValueError(f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is" f" {attn_weights.size()}") if attention_mask is not None: if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len): raise ValueError(f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}") attn_weights = attn_weights + attention_mask # upcast attention to fp32 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) attn_output = torch.matmul(attn_weights, value_states) if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim): raise ValueError(f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is" f" {attn_output.size()}") attn_output = attn_output.transpose(1, 2).contiguous() attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim) attn_output = self.out_proj(attn_output) return attn_output, attn_weights # Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->SigLip class SigLipMLP(nn.Module): def __init__(self, config): super().__init__() self.config = config self.activation_fn = ACT2FN[config.hidden_act] self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size) self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size) def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: hidden_states = self.fc1(hidden_states) hidden_states = self.activation_fn(hidden_states) hidden_states = self.fc2(hidden_states) return hidden_states # Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->SigLip class SigLipEncoderLayer(nn.Module): def __init__(self, config: SigLipVisionConfig): super().__init__() self.embed_dim = config.hidden_size self.self_attn = SigLipAttention(config) self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) self.mlp = SigLipMLP(config) self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) # Ignore copy def forward( self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: Optional[bool] = False, ) -> Tuple[torch.FloatTensor]: """ Args: hidden_states (`torch.FloatTensor`): Input to the layer of shape `(batch, seq_len, embed_dim)`. attention_mask (`torch.FloatTensor`): Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values. output_attentions (`bool`, *optional*, defaults to `False`): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. """ residual = hidden_states hidden_states = self.layer_norm1(hidden_states) hidden_states, attn_weights = self.self_attn( hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions, ) hidden_states = residual + hidden_states residual = hidden_states hidden_states = self.layer_norm2(hidden_states) hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states outputs = (hidden_states,) if output_attentions: outputs += (attn_weights,) return outputs class SigLipPreTrainedModel(PreTrainedModel): """ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models. """ config_class = SigLipVisionConfig base_model_prefix = "siglip" supports_gradient_checkpointing = True def _init_weights(self, module): """Initialize the weights""" pass # Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->SigLip class SigLipEncoder(nn.Module): """ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`SigLipEncoderLayer`]. Args: config: SigLipVisionConfig """ def __init__(self, config: SigLipVisionConfig): super().__init__() self.config = config self.layers = nn.ModuleList([SigLipEncoderLayer(config) for _ in range(config.num_hidden_layers)]) self.gradient_checkpointing = False # Ignore copy def forward( self, inputs_embeds, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, BaseModelOutput]: r""" Args: inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. output_hidden_states (`bool`, *optional*): Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail. return_dict (`bool`, *optional*): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. """ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states return_dict = return_dict if return_dict is not None else self.config.use_return_dict encoder_states = () if output_hidden_states else None all_attentions = () if output_attentions else None hidden_states = inputs_embeds for encoder_layer in self.layers: if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: layer_outputs = self._gradient_checkpointing_func( encoder_layer.__call__, hidden_states, attention_mask, output_attentions, ) else: layer_outputs = encoder_layer( hidden_states, attention_mask, output_attentions=output_attentions, ) hidden_states = layer_outputs[0] if output_attentions: all_attentions = all_attentions + (layer_outputs[1],) if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if not return_dict: return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None) return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions) class SigLipVisionTransformer(nn.Module): def __init__(self, config: SigLipVisionConfig): super().__init__() self.config = config embed_dim = config.hidden_size self.embeddings = SigLipVisionEmbeddings(config) self.encoder = SigLipEncoder(config) self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) self.head = SigLipMultiheadAttentionPoolingHead(config) def forward( self, pixel_values, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, BaseModelOutputWithPooling]: r""" Returns: """ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states return_dict = return_dict if return_dict is not None else self.config.use_return_dict hidden_states = self.embeddings(pixel_values) encoder_outputs = self.encoder( inputs_embeds=hidden_states, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) last_hidden_state = encoder_outputs[0] last_hidden_state = self.post_layernorm(last_hidden_state) pooled_output = self.head(last_hidden_state) if not return_dict: return (last_hidden_state, pooled_output) + encoder_outputs[1:] return BaseModelOutputWithPooling( last_hidden_state=last_hidden_state, pooler_output=pooled_output, hidden_states=encoder_outputs.hidden_states, attentions=encoder_outputs.attentions, ) class SigLipMultiheadAttentionPoolingHead(nn.Module): """Multihead Attention Pooling.""" def __init__(self, config: SigLipVisionConfig): super().__init__() self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size)) self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True) self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.mlp = SigLipMLP(config) def forward(self, hidden_state): batch_size = hidden_state.shape[0] probe = self.probe.repeat(batch_size, 1, 1) hidden_state = self.attention(probe, hidden_state, hidden_state)[0] residual = hidden_state hidden_state = self.layernorm(hidden_state) hidden_state = residual + self.mlp(hidden_state) return hidden_state[:, 0] class SigLipVisionModel(SigLipPreTrainedModel): config_class = SigLipVisionConfig main_input_name = "pixel_values" _no_split_modules = ["SigLipEncoderLayer"] def __init__(self, config: SigLipVisionConfig): super().__init__(config) self.vision_model = SigLipVisionTransformer(config) # Initialize weights and apply final processing self.post_init() def get_input_embeddings(self) -> nn.Module: return self.vision_model.embeddings.patch_embedding def forward( self, pixel_values, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, BaseModelOutputWithPooling]: r""" Returns: Examples: ```python >>> from PIL import Image >>> import requests >>> from transformers import AutoProcessor, SigLipVisionModel >>> model = SigLipVisionModel.from_pretrained("google/siglip-base-patch16-224") >>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) >>> inputs = processor(images=image, return_tensors="pt") >>> outputs = model(**inputs) >>> last_hidden_state = outputs.last_hidden_state >>> pooled_output = outputs.pooler_output # pooled features ```""" return_dict = return_dict if return_dict is not None else self.config.use_return_dict return self.vision_model( pixel_values=pixel_values, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) class SigLipVisionTower(nn.Module): def __init__(self, vision_tower, vision_tower_cfg, delay_load=False): super().__init__() self.is_loaded = False self.config = SigLipVisionConfig() self.vision_tower_name = vision_tower self.image_processor = SigLipImageProcessor() if not delay_load: rank0_print(f"Loading vision tower: {vision_tower}") self.load_model() elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False): # TODO: better detector is needed. rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.") self.load_model() elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts: rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.") self.load_model() else: self.cfg_only = self.config def load_model(self, device_map=None): if self.is_loaded: rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name)) return self.vision_tower = SigLipVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map) del self.vision_tower.vision_model.encoder.layers[-1:] self.vision_tower.vision_model.head = nn.Identity() self.vision_tower.requires_grad_(False) self.is_loaded = True def forward(self, images): if type(images) is list: image_features = [] for image in images: image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True) image_feature = image_forward_out.hidden_states[-1].to(image.dtype) assert image_features.shape[-2] == 729 image_features.append(image_feature) else: image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) image_features = image_forward_outs.hidden_states[-1].to(images.dtype) assert image_features.shape[-2] == 729 return image_features @property def dummy_feature(self): return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype) @property def dtype(self): for p in self.vision_tower.parameters(): return p.dtype @property def device(self): for p in self.vision_tower.parameters(): return p.device @property def hidden_size(self): return self.config.hidden_size @property def num_patches(self): return (self.config.image_size // self.config.patch_size) ** 2 @property def num_patches_per_side(self): return self.config.image_size // self.config.patch_size # return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"] @property def image_size(self): return self.config.image_size ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/umt/vit.py ================================================ import numpy as np import torch import torch.nn as nn import torch.nn.functional as F import torch.utils.checkpoint as checkpoint from functools import partial try: from flash_attn import flash_attn_qkvpacked_func except: print("You need to install flash_attn") from timm.models.layers import drop_path, to_2tuple, trunc_normal_ # logger = logging.getLogger(__name__) class DropPath(nn.Module): """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). """ def __init__(self, drop_prob=None): super(DropPath, self).__init__() self.drop_prob = drop_prob def forward(self, x): return drop_path(x, self.drop_prob, self.training) def extra_repr(self) -> str: return 'p={}'.format(self.drop_prob) class Mlp(nn.Module): def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.): super().__init__() out_features = out_features or in_features hidden_features = hidden_features or in_features self.fc1 = nn.Linear(in_features, hidden_features) self.act = act_layer() self.fc2 = nn.Linear(hidden_features, out_features) self.drop = nn.Dropout(drop) def forward(self, x): x = self.fc1(x) x = self.act(x) x = self.drop(x) x = self.fc2(x) x = self.drop(x) return x class Attention(nn.Module): def __init__( self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., attn_head_dim=None, attn_type='flash_v2'): super().__init__() self.num_heads = num_heads head_dim = dim // num_heads if attn_head_dim is not None: head_dim = attn_head_dim all_head_dim = head_dim * self.num_heads self.scale = qk_scale or head_dim ** -0.5 self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False) if qkv_bias: self.q_bias = nn.Parameter(torch.zeros(all_head_dim)) self.v_bias = nn.Parameter(torch.zeros(all_head_dim)) else: self.q_bias = None self.v_bias = None if attn_type not in ['origin', 'flash_v2']: raise NotImplementedError(f"Not support attn_type: {attn_type}") print('umt:', f'attn_type: {attn_type}') self.attn_type = attn_type if attn_type == 'flash_v2': self.attn_drop = attn_drop else: self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(all_head_dim, dim) self.proj_drop = nn.Dropout(proj_drop) def forward(self, x): B, N, C = x.shape qkv_bias = None if self.q_bias is not None: qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias)) # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias) if self.attn_type == 'flash_v2': qkv = qkv.reshape(B, N, 3, self.num_heads, -1) x = flash_attn_qkvpacked_func(qkv, dropout_p=self.attn_drop, softmax_scale=self.scale, causal=False).reshape(B, N, -1) else: qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[ 2] # make torchscript happy (cannot use tensor as tuple) # B num_heads N head_dim q = q * self.scale attn = (q @ k.transpose(-2, -1)) attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, -1) x = self.proj(x) x = self.proj_drop(x) return x class Block(nn.Module): def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0., drop_path=0., init_values=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm, attn_head_dim=None): super().__init__() self.norm1 = norm_layer(dim) self.attn = Attention( dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim) # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity() self.norm2 = norm_layer(dim) mlp_hidden_dim = int(dim * mlp_ratio) self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop) if init_values > 0: self.gamma_1 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True) self.gamma_2 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True) else: self.gamma_1, self.gamma_2 = None, None def forward(self, x): if self.gamma_1 is None: x = x + self.drop_path(self.attn(self.norm1(x))) x = x + self.drop_path(self.mlp(self.norm2(x))) else: x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x))) x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x))) return x class PatchEmbed(nn.Module): """ Image to Patch Embedding """ def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=16, tubelet_size=2): super().__init__() img_size = to_2tuple(img_size) patch_size = to_2tuple(patch_size) self.tubelet_size = int(tubelet_size) num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (num_frames // self.tubelet_size) self.img_size = img_size self.patch_size = patch_size self.num_patches = num_patches self.proj = nn.Conv3d( in_channels=in_chans, out_channels=embed_dim, kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]), stride=(self.tubelet_size, patch_size[0], patch_size[1]) ) print('umt:', f'Num of patches: {num_patches}') def forward(self, x, **kwargs): B, C, T, H, W = x.shape # FIXME look at relaxing size constraints # assert H == self.img_size[0] and W == self.img_size[1], \ # f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})." x = self.proj(x).flatten(2).transpose(1, 2) return x # sin-cos position encoding # https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31 def get_sinusoid_encoding_table(n_position, d_hid, ckpt_num_frame=-1, cur_frame=12): ''' Sinusoid position encoding table ''' # TODO: make it with torch instead of numpy def get_position_angle_vec(position): return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)] if ckpt_num_frame != -1 and ckpt_num_frame != cur_frame: print('umt:', f"Interpolate position embedding") print('umt:', f"Testing frame: {cur_frame}") print('umt:', f"Checkpoint frame: {ckpt_num_frame}") T = ckpt_num_frame # checkpoint frame new_T = cur_frame # testing frame n_position = n_position // new_T * T # generate checkpoint position embedding sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)]) sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0) # interpolate P = int((n_position // T) ** 0.5) C = d_hid sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C) sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear') sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C sinusoid_table = sinusoid_table.flatten(1, 3) return sinusoid_table else: sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)]) sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 return torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0) def get_sinusoid_encoding_table2(n_position=784, d_hid=1024, cur_frame=8, ckpt_num_frame=4, pre_n_position=784): ''' Sinusoid position encoding table ''' # TODO: make it with torch instead of numpy def get_position_angle_vec(position): return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)] # generate checkpoint position embedding sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(pre_n_position)]) sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0) print(f"n_position: {n_position}") print(f"pre_n_position: {pre_n_position}") if n_position != pre_n_position: T = ckpt_num_frame # checkpoint frame P = 14 # checkpoint size C = d_hid new_P = int((n_position // cur_frame) ** 0.5) # testing size print(f'Pretraining uses 14x14, but current version is {new_P}x{new_P}') print(f'Interpolate the position embedding') sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C) sinusoid_table = sinusoid_table.reshape(-1, P, P, C).permute(0, 3, 1, 2) sinusoid_table = torch.nn.functional.interpolate( sinusoid_table, size=(new_P, new_P), mode='bicubic', align_corners=False) # BT, C, H, W -> BT, H, W, C -> B, T, H, W, C sinusoid_table = sinusoid_table.permute(0, 2, 3, 1).reshape(-1, T, new_P, new_P, C) sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C if cur_frame != ckpt_num_frame: print(f'Pretraining uses 4 frames, but current frame is {cur_frame}') print(f'Interpolate the position embedding') T = ckpt_num_frame # checkpoint frame new_T = cur_frame # testing frame # interpolate P = int((n_position // cur_frame) ** 0.5) # testing size C = d_hid sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C) sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear') sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C return sinusoid_table class PretrainVisionTransformerEncoder(nn.Module): """ Vision Transformer with support for patch or hybrid CNN input stage """ def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm, init_values=None, num_frames=8, tubelet_size=1, use_learnable_pos_emb=False, use_checkpoint=False, checkpoint_num=0, ckpt_num_frame=-1, with_ln=True, return_index=-1 ): super().__init__() self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models self.patch_embed = PatchEmbed( img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim, num_frames=num_frames, tubelet_size=tubelet_size ) num_patches = self.patch_embed.num_patches self.depth = depth + return_index + 1 self.use_checkpoint = use_checkpoint self.checkpoint_num = checkpoint_num print('umt:', f"Use checkpoint: {use_checkpoint}") print('umt:', f"Checkpoint number: {checkpoint_num}") print('umt:', f"Real runing depth: {self.depth}") # TODO: Add the cls token if use_learnable_pos_emb: self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) self.img_pos_embed = nn.Parameter(torch.zeros(1, num_patches//(num_frames//tubelet_size) + 1, embed_dim)) else: # sine-cosine positional embeddings if img_size != 224: self.pos_embed = get_sinusoid_encoding_table2(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size) self.img_pos_embed = get_sinusoid_encoding_table2(num_patches//(num_frames//tubelet_size), embed_dim, cur_frame=1, ckpt_num_frame=1, pre_n_position=14*14) else: self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size) self.img_pos_embed = get_sinusoid_encoding_table(num_patches//(num_frames//tubelet_size), embed_dim) dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule self.blocks = nn.ModuleList([ Block( dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer, init_values=init_values) for i in range(self.depth)]) if with_ln: self.vision_layernorm = nn.LayerNorm(embed_dim, eps=1e-12) else: self.vision_layernorm = nn.Identity() if use_learnable_pos_emb: trunc_normal_(self.pos_embed, std=.02) @torch.jit.ignore def no_weight_decay(self): return {'pos_embed', 'cls_token'} def forward_features(self, x, use_image=False): x = self.patch_embed(x) if use_image: x = x + self.img_pos_embed.type_as(x).to(x.device).clone().detach() else: x = x + self.pos_embed.type_as(x).to(x.device).clone().detach() B, _, C = x.shape x_vis = x for idx, blk in enumerate(self.blocks): if self.use_checkpoint and idx < self.checkpoint_num: x_vis = checkpoint.checkpoint(blk, x_vis) else: x_vis = blk(x_vis) # with ln ot not x_vis = self.vision_layernorm(x_vis) return x_vis def forward(self, x, use_image=False): x_vis = self.forward_features(x, use_image) return x_vis class PretrainVisionTransformer(nn.Module): """ Vision Transformer with support for patch or hybrid CNN input stage """ def __init__(self, img_size=224, patch_size=16, encoder_in_chans=3, encoder_embed_dim=768, encoder_depth=12, encoder_num_heads=12, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=partial(nn.LayerNorm, eps=1e-6), init_values=0., use_learnable_pos_emb=False, num_frames=8, tubelet_size=1, use_checkpoint=False, checkpoint_num=0, ckpt_num_frame=4, # the pretrained model uses 4 frames return_index=-1, with_ln=False ): super().__init__() self.encoder = PretrainVisionTransformerEncoder( img_size=img_size, patch_size=patch_size, in_chans=encoder_in_chans, embed_dim=encoder_embed_dim, depth=encoder_depth, num_heads=encoder_num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, drop_rate=drop_rate, attn_drop_rate=attn_drop_rate, drop_path_rate=drop_path_rate, norm_layer=norm_layer, init_values=init_values, num_frames=num_frames, tubelet_size=tubelet_size, use_learnable_pos_emb=use_learnable_pos_emb, use_checkpoint=use_checkpoint, checkpoint_num=checkpoint_num, ckpt_num_frame=ckpt_num_frame, with_ln=with_ln, return_index=return_index ) print('umt:', f'With LN: {with_ln}') print('umt:', f'Total {encoder_depth} layer') print('umt:', f'Return {encoder_depth+return_index+1}-th layer') self.apply(self._init_weights) def _init_weights(self, m): if isinstance(m, nn.Linear): nn.init.xavier_uniform_(m.weight) if isinstance(m, nn.Linear) and m.bias is not None: nn.init.constant_(m.bias, 0) elif isinstance(m, nn.LayerNorm): nn.init.constant_(m.bias, 0) nn.init.constant_(m.weight, 1.0) @torch.jit.ignore def no_weight_decay(self): return {'pos_embed', 'cls_token', 'clip_pos_embed'} def forward(self, x, use_image=False): T = x.shape[2] x_vis = self.encoder(x, use_image) # [B, N_vis, C_e] B, TL, C = x_vis.shape x_vis = x_vis.view(B, T, TL // T, C) return x_vis ================================================ FILE: llava-train_videochat/llava/model/multimodal_encoder/umt_encoder.py ================================================ """ # Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py """ from typing import Optional, Tuple, Union, Dict from dataclasses import dataclass from functools import partial, reduce from PIL import Image import torch import torch.utils.checkpoint from torch import nn import os from transformers.image_processing_utils import BatchFeature, get_size_dict from transformers.image_transforms import ( convert_to_rgb, normalize, rescale, resize, to_channel_dimension_format, ) from transformers.image_utils import ( ChannelDimension, PILImageResampling, to_numpy_array, ) from llava.utils import rank0_print from .umt.vit import PretrainVisionTransformer class UMTImageProcessor: def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST): crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]} crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") self.image_mean = image_mean self.image_std = image_std self.size = size self.resample = resample self.rescale_factor = rescale_factor self.data_format = data_format self.crop_size = crop_size def preprocess(self, images, return_tensors, target_size=None): if isinstance(images, Image.Image): images = [images] else: # to adapt video data images = [to_numpy_array(image) for image in images] assert isinstance(images, list) if target_size is None: target_size = self.size transforms = [ convert_to_rgb, to_numpy_array, partial(resize, size=target_size, resample=self.resample, data_format=self.data_format), partial(rescale, scale=self.rescale_factor, data_format=self.data_format), partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format), partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format), ] images = reduce(lambda x, f: [*map(f, x)], transforms, images) data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) class UMTVisionConfig: model_type = "umt_vision_model" def __init__( self, num_frames=4, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, num_channels=3, image_size=224, patch_size=16, return_idx=-2 # **kwargs, ): # super().__init__(**kwargs) self.num_frames = num_frames self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.num_channels = num_channels self.patch_size = patch_size self.image_size = image_size self.return_idx = return_idx def build_vit(config, pt_type='origin'): model = PretrainVisionTransformer( img_size=config.image_size, patch_size=16, encoder_embed_dim=1024, encoder_depth=24, encoder_num_heads=16, drop_path_rate=0., num_frames=config.num_frames, tubelet_size=1, use_checkpoint=True, checkpoint_num=24, return_index=config.return_idx, with_ln=True, # merge vision_layernorm in it ) ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt" if not os.path.isfile(ckpt_path): raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt") old_state_dict = torch.load(ckpt_path, map_location='cpu') state_dict = {} for k in old_state_dict: if k.startswith("encoder."): if k.startswith("encoder.norm"): state_dict[k.replace('encoder.norm', 'encoder.vision_layernorm')] = old_state_dict[k] else: state_dict[k] = old_state_dict[k] del old_state_dict msg = model.load_state_dict(state_dict, strict=False) print('umt:', f"Loading pretrained weights from {ckpt_path}", msg) return model class UMTVisionTower(nn.Module): def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224): super().__init__() self.is_loaded = False self.pt_type = pt_type self.config = UMTVisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size) self.vision_tower_name = vision_tower self.image_processor = UMTImageProcessor(size=(image_size, image_size)) if not delay_load: rank0_print(f"Loading vision tower: {vision_tower}") self.load_model() elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False): # TODO: better detector is needed. rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.") self.load_model() elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts: rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.") self.load_model() else: self.cfg_only = self.config def load_model(self, device_map=None): if self.is_loaded: rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name)) return self.vision_tower = build_vit(self.config, pt_type=self.pt_type) self.vision_tower.requires_grad_(False) self.is_loaded = True def forward(self, images): if type(images) is list: raise NotImplementedError else: # input: B T C H W # output: B T*L C T = images.shape[1] images = images.permute(0, 2, 1, 3, 4) image_embeds = self.vision_tower(images, use_image=(T == 1)) B, T, L, C = image_embeds.shape image_embeds = image_embeds.reshape(B, -1, C) return image_embeds @property def dummy_feature(self): return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype) @property def dtype(self): for p in self.vision_tower.parameters(): return p.dtype @property def device(self): for p in self.vision_tower.parameters(): return p.device @property def hidden_size(self): return self.config.hidden_size @property def num_patches(self): return (self.config.image_size // self.config.patch_size) ** 2 @property def num_patches_per_side(self): return self.config.image_size // self.config.patch_size # return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"] @property def image_size(self): return self.config.image_size ================================================ FILE: llava-train_videochat/llava/model/multimodal_projector/builder.py ================================================ import torch import torch.nn as nn import re from .tome16_mlp_hd64 import ToMe16_mlp_hd64 class IdentityMap(nn.Module): def __init__(self): super().__init__() def forward(self, x, *args, **kwargs): return x @property def config(self): return {"mm_projector_type": "identity"} class SimpleResBlock(nn.Module): def __init__(self, channels): super().__init__() self.pre_norm = nn.LayerNorm(channels) self.proj = nn.Sequential(nn.Linear(channels, channels), nn.GELU(), nn.Linear(channels, channels)) def forward(self, x): x = self.pre_norm(x) return x + self.proj(x) def build_vision_projector(config, delay_load=False, **kwargs): projector_type = getattr(config, "mm_projector_type", "linear") if projector_type == 'tome16_mlp_hd64': return ToMe16_mlp_hd64(config, kwargs["vision_cfg"]) if projector_type == "linear": return nn.Linear(config.mm_hidden_size, config.hidden_size) mlp_gelu_match = re.match(r"^mlp(\d+)x_gelu$", projector_type) if mlp_gelu_match: mlp_depth = int(mlp_gelu_match.group(1)) modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)] for _ in range(1, mlp_depth): modules.append(nn.GELU()) modules.append(nn.Linear(config.hidden_size, config.hidden_size)) return nn.Sequential(*modules) mlp_gelu_resnet_match = re.match(r"^mlp(\d+)x_res(\d+)x_gelu$", projector_type) if mlp_gelu_resnet_match: mlp_depth = int(mlp_gelu_resnet_match.group(1)) res_depth = int(mlp_gelu_resnet_match.group(2)) modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)] for _ in range(1, mlp_depth): modules.append(nn.GELU()) modules.append(nn.Linear(config.hidden_size, config.hidden_size)) for _ in range(res_depth): modules.append(SimpleResBlock(config.hidden_size)) return nn.Sequential(*modules) if projector_type == "identity": return IdentityMap() raise ValueError(f"Unknown projector type: {projector_type}") ================================================ FILE: llava-train_videochat/llava/model/multimodal_projector/tome16_mlp_hd64.py ================================================ # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # -------------------------------------------------------- import torch import torch.nn as nn from typing import Callable, Tuple import torch.nn.functional as F def bipartite_soft_matching( metric: torch.Tensor, r: int, ) -> Tuple[Callable, Callable]: """ Applies ToMe with a balanced matching set (50%, 50%). Input size is [batch, tokens, channels]. r indicates the number of tokens to remove (max 50% of tokens). """ protected = 0 t = metric.shape[1] r = min(r, (t - protected) // 2) assert r > 0, r with torch.no_grad(): metric = metric / metric.norm(dim=-1, keepdim=True) a, b = metric[..., ::2, :], metric[..., 1::2, :] scores = a @ b.transpose(-1, -2) node_max, node_idx = scores.max(dim=-1) edge_idx = node_max.argsort(dim=-1, descending=True)[..., None] unm_idx = edge_idx[..., r:, :] # Unmerged Tokens src_idx = edge_idx[..., :r, :] # Merged Tokens dst_idx = node_idx[..., None].gather(dim=-2, index=src_idx) def merge(x: torch.Tensor, mode="mean") -> torch.Tensor: src, dst = x[..., ::2, :], x[..., 1::2, :] n, t1, c = src.shape unm = src.gather(dim=-2, index=unm_idx.expand(n, t1 - r, c)) src = src.gather(dim=-2, index=src_idx.expand(n, r, c)) dst = dst.scatter_add(-2, dst_idx.expand(n, r, c), src) # , reduce=mode) return torch.cat([unm, dst], dim=1) def unmerge(x: torch.Tensor) -> torch.Tensor: unm_len = unm_idx.shape[1] unm, dst = x[..., :unm_len, :], x[..., unm_len:, :] n, _, c = unm.shape src = dst.gather(dim=-2, index=dst_idx.expand(n, r, c)) out = torch.zeros(n, metric.shape[1], c, device=x.device, dtype=x.dtype) out[..., 1::2, :] = dst out.scatter_(dim=-2, index=(2 * unm_idx).expand(n, unm_len, c), src=unm) out.scatter_(dim=-2, index=(2 * src_idx).expand(n, r, c), src=src) return out return merge, unmerge def merge_wavg( merge: Callable, x: torch.Tensor, size: torch.Tensor = None ) -> Tuple[torch.Tensor, torch.Tensor]: """ Applies the merge function by taking a weighted average based on token size. Returns the merged tensor and the new token sizes. """ if size is None: size = torch.ones_like(x[..., 0, None]) x = merge(x * size, mode="sum") size = merge(size, mode="sum") x = x / size return x, size class ToMe16_mlp_hd64(nn.Module): def __init__(self, config, vision_cfg): super().__init__() self._config = config self.mm_hidden_size = config.mm_hidden_size self.hw = vision_cfg.image_size // vision_cfg.patch_size self.num_attention_heads = vision_cfg.num_attention_heads self.mlp = nn.Sequential(nn.Linear(config.mm_hidden_size, config.hidden_size), nn.GELU(), nn.Linear(config.hidden_size, config.hidden_size)) self.max_pos_hw = self.hw self.max_pos_num_frames = config.mm_pos_num_frames # self._set_3d_pos_cache(max_grid_size=self.max_pos_hw, max_t_size=self.max_pos_num_frames) self.num_image_patches_per_side = 8 self.num_frame_patches_per_side = 4 def merge_tokens(self, x, target_num_token): r""" x = torch.randn(10, 2560, c) x = merge_tokens(x, r_merge_list=[1280]) """ size = None b, p, c = x.shape tmp_p = p r_merge_list = [] assert tmp_p > target_num_token, f"{tmp_p} should greater than {target_num_token}" while tmp_p != target_num_token: if tmp_p - target_num_token <= (tmp_p // 2): r_merge_list.append(tmp_p - target_num_token) break else: r_merge_list.append(tmp_p // 2) tmp_p = tmp_p - (tmp_p // 2) head = self.num_attention_heads dim = c // head for r in r_merge_list: metric = x.reshape(b, p, head, dim).mean(2) # [b, p, c//head] merge, _ = bipartite_soft_matching( metric, r ) x, size = merge_wavg(merge, x, size) _, p, _ = x.shape # x = x.reshape(-1, c) # 300, 1024 return x def forward(self, x, compress=False, local_num_frames=-1): height = width = self.hw assert height * width == x.shape[1] dtype = x.dtype device = x.device if local_num_frames != -1 and local_num_frames != 1: assert compress is True if compress: if local_num_frames != -1: num_frames = local_num_frames x = x.reshape(x.shape[0] // local_num_frames, -1, x.shape[-1]) else: num_frames = x.shape[0] x = x.reshape(1, -1, x.shape[-1]) num_tome_tokens = 16 * num_frames else: num_tome_tokens = 64 x = self.merge_tokens(x, target_num_token=num_tome_tokens) x = self.mlp(x) return x @property def config(self): return {"mm_projector_type": "tome16_mlp_hd64"} ================================================ FILE: llava-train_videochat/llava/model/utils.py ================================================ from transformers import AutoConfig def auto_upgrade(config): cfg = AutoConfig.from_pretrained(config) if "llava" in config and "llava" not in cfg.model_type: assert cfg.model_type == "llama" print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.") print("You must upgrade the checkpoint to the new code base (this can be done automatically).") confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]") if confirm.lower() in ["y", "yes"]: print("Upgrading checkpoint...") assert len(cfg.architectures) == 1 setattr(cfg.__class__, "model_type", "llava") cfg.architectures[0] = "LlavaLlamaForCausalLM" cfg.save_pretrained(config) print("Checkpoint upgraded.") else: print("Checkpoint upgrade aborted.") exit(1) ================================================ FILE: llava-train_videochat/llava/serialize_utils.py ================================================ # Description: This file contains the code for serializing the dataset. # From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/blob/795868a37446d61412b9a58dbb1b7c76e75d39c4/serialize.py # Copyright (c) Facebook, Inc. and its affiliates. """ List serialization code adopted from https://github.com/facebookresearch/detectron2/blob/main/detectron2/data/common.py """ import multiprocessing as mp from typing import List, Any, Optional import pickle import numpy as np import torch import torch.distributed as dist import functools import os from datetime import timedelta def get_world_size() -> int: if not dist.is_available(): return 1 if not dist.is_initialized(): return 1 return dist.get_world_size() def get_rank() -> int: if not dist.is_available(): return 0 if not dist.is_initialized(): return 0 return dist.get_rank() def get_rank() -> int: if not dist.is_available(): return 0 if not dist.is_initialized(): return 0 return dist.get_rank() def get_local_rank() -> int: if not dist.is_available(): return 0 if not dist.is_initialized(): return 0 # this is not guaranteed to be set if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ: return int(os.environ['LOCAL_RANK']) elif 'SLURM_PROCID' in os.environ: return int(os.environ['SLURM_LOCALID']) else: raise RuntimeError("Unable to get local rank") def get_local_size() -> int: return torch.cuda.device_count() @functools.lru_cache() def _get_global_gloo_group(): """ Return a process group based on gloo backend, containing all the ranks The result is cached. """ if dist.get_backend() == "nccl": return dist.new_group(backend="gloo", timeout=timedelta(minutes=60)) else: return dist.group.WORLD def all_gather(data, group=None): """ Run all_gather on arbitrary picklable data (not necessarily tensors). Args: data: any picklable object group: a torch process group. By default, will use a group which contains all ranks on gloo backend. Returns: list[data]: list of data gathered from each rank """ if get_world_size() == 1: return [data] if group is None: group = ( _get_global_gloo_group() ) # use CPU group by default, to reduce GPU RAM usage. world_size = dist.get_world_size(group) if world_size == 1: return [data] output = [None for _ in range(world_size)] dist.all_gather_object(output, data, group=group) return output class NumpySerializedList: def __init__(self, lst: list): def _serialize(data): buffer = pickle.dumps(data, protocol=-1) return np.frombuffer(buffer, dtype=np.uint8) print( "Serializing {} elements to byte tensors and concatenating them all ...".format( len(lst) ) ) self._lst = [_serialize(x) for x in lst] self._addr = np.asarray([len(x) for x in self._lst], dtype=np.int64) self._addr = np.cumsum(self._addr) self._lst = np.concatenate(self._lst) print("Serialized dataset takes {:.2f} MiB".format(len(self._lst) / 1024**2)) def __len__(self): return len(self._addr) def __getitem__(self, idx): start_addr = 0 if idx == 0 else self._addr[idx - 1].item() end_addr = self._addr[idx].item() bytes = memoryview(self._lst[start_addr:end_addr]) return pickle.loads(bytes) class TorchSerializedList(NumpySerializedList): def __init__(self, lst: list): super().__init__(lst) self._addr = torch.from_numpy(self._addr) self._lst = torch.from_numpy(self._lst) def __getitem__(self, idx): start_addr = 0 if idx == 0 else self._addr[idx - 1].item() end_addr = self._addr[idx].item() bytes = memoryview(self._lst[start_addr:end_addr].numpy()) return pickle.loads(bytes) def local_scatter(array: Optional[List[Any]]): """ Scatter an array from local leader to all local workers. The i-th local worker gets array[i]. Args: array: Array with same size of #local workers. """ if get_local_size() <= 1: # Just one worker. Do nothing. return array[0] if get_local_rank() == 0: assert len(array) == get_local_size() all_gather(array) else: all_data = all_gather(None) array = all_data[get_rank() - get_local_rank()] return array[get_local_rank()] # NOTE: https://github.com/facebookresearch/mobile-vision/pull/120 # has another implementation that does not use tensors. class TorchShmSerializedList(TorchSerializedList): def __init__(self, lst: list): if get_local_rank() == 0: super().__init__(lst) if get_local_rank() == 0: # Move data to shared memory, obtain a handle to send to each local worker. # This is cheap because a tensor will only be moved to shared memory once. handles = [None] + [ bytes(mp.reduction.ForkingPickler.dumps((self._addr, self._lst))) for _ in range(get_local_size() - 1) ] else: handles = None # Each worker receives the handle from local leader. handle = local_scatter(handles) if get_local_rank() > 0: # Materialize the tensor from shared memory. self._addr, self._lst = mp.reduction.ForkingPickler.loads(handle) print( f"Worker {get_rank()} obtains a dataset of length=" f"{len(self)} from its local leader." ) # From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/issues/5#issuecomment-1510676170 def local_broadcast_process_authkey(): if int(os.environ['LOCAL_WORLD_SIZE']) == 1: return local_rank = int(os.environ['LOCAL_RANK']) authkey = bytes(mp.current_process().authkey) all_keys = all_gather(authkey) local_leader_key = all_keys[get_rank() - local_rank] if authkey != local_leader_key: print("Process authkey is different from the key of local leader. This might happen when " "workers are launched independently.") print("Overwriting local authkey ...") mp.current_process().authkey = local_leader_key ================================================ FILE: llava-train_videochat/llava/train/llava_trainer.py ================================================ import os import torch import torch.nn as nn import datetime from accelerate import Accelerator from accelerate.utils import InitProcessGroupKwargs, GradientAccumulationPlugin from torch.utils.data import Dataset, Sampler, DataLoader from transformers import Trainer from transformers.trainer import is_sagemaker_mp_enabled, get_parameter_names, has_length, ALL_LAYERNORM_LAYERS, logger, is_accelerate_available, is_datasets_available, GradientAccumulationPlugin from transformers.trainer_utils import seed_worker from transformers.trainer_pt_utils import get_length_grouped_indices as get_length_grouped_indices_hf from transformers.trainer_pt_utils import AcceleratorConfig from typing import List, Optional from datetime import timedelta if is_accelerate_available(): from accelerate import Accelerator, skip_first_batches, InitProcessGroupKwargs if is_datasets_available(): import datasets from llava.utils import rank0_print def maybe_zero_3(param, ignore_status=False, name=None): from deepspeed import zero from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus if hasattr(param, "ds_id"): if param.ds_status == ZeroParamStatus.NOT_AVAILABLE: if not ignore_status: print(name, "no ignore status") with zero.GatheredParameters([param]): param = param.data.detach().cpu().clone() else: param = param.detach().cpu().clone() return param def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match): to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)} to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()} return to_return def split_to_even_chunks(indices, lengths, num_chunks): """ Split a list of indices into `chunks` chunks of roughly equal lengths. """ if len(indices) % num_chunks != 0: return [indices[i::num_chunks] for i in range(num_chunks)] num_indices_per_chunk = len(indices) // num_chunks chunks = [[] for _ in range(num_chunks)] chunks_lengths = [0 for _ in range(num_chunks)] for index in indices: shortest_chunk = chunks_lengths.index(min(chunks_lengths)) chunks[shortest_chunk].append(index) chunks_lengths[shortest_chunk] += lengths[index] if len(chunks[shortest_chunk]) == num_indices_per_chunk: chunks_lengths[shortest_chunk] = float("inf") return chunks def get_variable_length_grouped_indices(lengths, batch_size, world_size, megabatch_mult=8, generator=None): # We need to use torch for the random part as a distributed sampler will set the random seed for torch. indices = torch.randperm(len(lengths), generator=generator) sorted_indices = sorted(range(len(lengths)), key=lambda i: lengths[i], reverse=True) megabatch_size = world_size * batch_size * megabatch_mult megabatches = [sorted_indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)] megabatches = [sorted(megabatch, key=lambda i: indices[i], reverse=True) for megabatch in megabatches] shuffled_indices = [i for megabatch in megabatches for i in megabatch] world_batch_size = world_size * batch_size batches = [shuffled_indices[i : i + world_batch_size] for i in range(0, len(lengths), world_batch_size)] batch_indices = torch.randperm(len(batches), generator=generator) batches = [batches[i] for i in batch_indices] return [i for batch in batches for i in batch] def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None): """ Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar lengths. To do this, the indices are: - randomly permuted - grouped in mega-batches of size `mega_batch_mult * batch_size` - reorder by length in each mega-batch The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of maximum length placed first, so that an OOM happens sooner rather than later. """ # We need to use torch for the random part as a distributed sampler will set the random seed for torch. assert all(l != 0 for l in lengths), "Should not have zero length." if all(l > 0 for l in lengths) or all(l < 0 for l in lengths): # all samples are in the same modality return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator) mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0]) lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0]) mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)] lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)] megabatch_size = world_size * batch_size mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)] lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)] last_mm = mm_megabatches[-1] last_lang = lang_megabatches[-1] additional_batch = last_mm + last_lang megabatches = mm_megabatches[:-1] + lang_megabatches[:-1] megabatch_indices = torch.randperm(len(megabatches), generator=generator) megabatches = [megabatches[i] for i in megabatch_indices] if len(additional_batch) > 0: megabatches.append(sorted(additional_batch)) return [i for megabatch in megabatches for i in megabatch] def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True): """ Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar lengths. To do this, the indices are: - randomly permuted - grouped in mega-batches of size `mega_batch_mult * batch_size` - reorder by length in each mega-batch The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of maximum length placed first, so that an OOM happens sooner rather than later. """ # We need to use torch for the random part as a distributed sampler will set the random seed for torch. indices = torch.randperm(len(lengths), generator=generator) megabatch_size = world_size * batch_size megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)] megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches] megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches] return [i for megabatch in megabatches for batch in megabatch for i in batch] def get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=None): indices = get_length_grouped_indices_hf(lengths, batch_size * world_size, generator=generator) megabatch_size = world_size * batch_size megabatches = [indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)] megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches] megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches] # We need to use torch for the random part as a distributed sampler will set the random seed for torch. batch_indices = torch.randperm(len(megabatches), generator=generator) megabatches = [megabatches[i] for i in batch_indices] return [i for megabatch in megabatches for batch in megabatch for i in batch] def get_modality_length_grouped_indices_auto(lengths, batch_size, world_size, generator=None): # We need to use torch for the random part as a distributed sampler will set the random seed for torch. assert all(l != 0 for l in lengths), "Should not have zero length." if all(l > 0 for l in lengths) or all(l < 0 for l in lengths): # all samples are in the same modality return get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=generator) mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0]) lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0]) mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices_auto_single(mm_lengths, batch_size, world_size, generator=None)] lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices_auto_single(lang_lengths, batch_size, world_size, generator=None)] megabatch_size = world_size * batch_size mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)] lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)] last_mm = mm_megabatches[-1] last_lang = lang_megabatches[-1] additional_batch = last_mm + last_lang megabatches = mm_megabatches[:-1] + lang_megabatches[:-1] megabatch_indices = torch.randperm(len(megabatches), generator=generator) megabatches = [megabatches[i] for i in megabatch_indices] # FIXME: Hard code to avoid last batch mixed with different modalities # if len(additional_batch) > 0: # megabatches.append(sorted(additional_batch)) return [i for megabatch in megabatches for i in megabatch] class LengthGroupedSampler(Sampler): r""" Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while keeping a bit of randomness. """ def __init__( self, batch_size: int, world_size: int, lengths: Optional[List[int]] = None, generator=None, variable_length: bool = False, group_by_modality: bool = False, group_by_modality_auto: bool = False, ): if lengths is None: raise ValueError("Lengths must be provided.") self.batch_size = batch_size self.world_size = world_size self.lengths = lengths self.generator = generator self.variable_length = variable_length self.group_by_modality = group_by_modality self.group_by_modality_auto = group_by_modality_auto def __len__(self): return len(self.lengths) def __iter__(self): if self.variable_length: assert not self.group_by_modality, "Variable length grouping is not supported with modality grouping." indices = get_variable_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) else: if self.group_by_modality: indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) elif self.group_by_modality_auto: indices = get_modality_length_grouped_indices_auto(self.lengths, self.batch_size, self.world_size, generator=self.generator) else: indices = get_length_grouped_indices_auto_single(self.lengths, self.batch_size, self.world_size, generator=self.generator) return iter(indices) class LLaVATrainer(Trainer): def create_accelerator_and_postprocess(self): grad_acc_kwargs = {"num_steps": self.args.gradient_accumulation_steps} grad_acc_kwargs["sync_with_dataloader"] = False gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs) accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) rank0_print("Setting NCCL timeout to INF to avoid running errors.") # create accelerator object self.accelerator = Accelerator( dispatch_batches=self.args.dispatch_batches, split_batches=self.args.split_batches, deepspeed_plugin=self.args.deepspeed_plugin, gradient_accumulation_plugin=gradient_accumulation_plugin, kwargs_handlers=[accelerator_kwargs] ) # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag self.gather_function = self.accelerator.gather_for_metrics # deepspeed and accelerate flags covering both trainer args and accelerate launcher self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None self.is_fsdp_enabled = getattr(self.accelerator.state, "fsdp_plugin", None) is not None # post accelerator creation setup if self.is_fsdp_enabled: fsdp_plugin = self.accelerator.state.fsdp_plugin fsdp_plugin.limit_all_gathers = self.args.fsdp_config.get("limit_all_gathers", fsdp_plugin.limit_all_gathers) if is_accelerate_available("0.23.0"): fsdp_plugin.activation_checkpointing = self.args.fsdp_config.get("activation_checkpointing", fsdp_plugin.activation_checkpointing) if fsdp_plugin.activation_checkpointing and self.args.gradient_checkpointing: raise ValueError("The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg " "can't be set to True simultaneously. Please use FSDP's activation_checkpointing logic " "when using FSDP.") if self.is_deepspeed_enabled and getattr(self.args, "hf_deepspeed_config", None) is None: self.propagate_args_to_deepspeed() def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]: if self.train_dataset is None or not has_length(self.train_dataset): return None if self.args.group_by_length: lengths = self.train_dataset.lengths return LengthGroupedSampler( # self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps self.args.train_batch_size, # world_size=self.args.world_size, world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work? lengths=lengths, ) elif self.args.group_by_modality_length: lengths = self.train_dataset.modality_lengths return LengthGroupedSampler( # self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps self.args.train_batch_size, # world_size=self.args.world_size, world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work? lengths=lengths, group_by_modality=True, ) elif self.args.group_by_modality_length_auto: lengths = self.train_dataset.modality_lengths return LengthGroupedSampler( # self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps self.args.train_batch_size, # world_size=self.args.world_size, world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work? lengths=lengths, group_by_modality_auto=True, ) elif self.args.group_by_varlen: lengths = self.train_dataset.lengths return LengthGroupedSampler( self.args.train_batch_size * self.args.gradient_accumulation_steps, # self.args.train_batch_size, # TODO: seems that we should have gradient_accumulation_steps # world_size=self.args.world_size, world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work? lengths=lengths, variable_length=True, ) else: return super()._get_train_sampler() def get_train_dataloader(self) -> DataLoader: """ Returns the training [`~torch.utils.data.DataLoader`]. Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed training if necessary) otherwise. Subclass and override this method if you want to inject some custom behavior. """ if self.train_dataset is None: raise ValueError("Trainer: training requires a train_dataset.") train_dataset = self.train_dataset data_collator = self.data_collator if is_datasets_available() and isinstance(train_dataset, datasets.Dataset): train_dataset = self._remove_unused_columns(train_dataset, description="training") else: data_collator = self._get_collator_with_removed_columns(data_collator, description="training") dataloader_params = { "batch_size": self._train_batch_size, "collate_fn": data_collator, "num_workers": self.args.dataloader_num_workers, "pin_memory": self.args.dataloader_pin_memory, "persistent_workers": self.args.dataloader_persistent_workers, } if not isinstance(train_dataset, torch.utils.data.IterableDataset): dataloader_params["sampler"] = self._get_train_sampler() dataloader_params["drop_last"] = self.args.dataloader_drop_last dataloader_params["worker_init_fn"] = seed_worker dataloader_params["prefetch_factor"] = self.args.dataloader_num_workers * 2 if self.args.dataloader_num_workers != 0 else None dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params)) return dataloader def create_optimizer(self): """ Setup the optimizer. We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer's init through `optimizers`, or subclass and override this method in a subclass. """ if is_sagemaker_mp_enabled(): return super().create_optimizer() opt_model = self.model if self.optimizer is None: decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS) decay_parameters = [name for name in decay_parameters if "bias" not in name] lr_mapper = {} if self.args.mm_projector_lr is not None: lr_mapper["mm_projector"] = self.args.mm_projector_lr if self.args.mm_vision_tower_lr is not None: lr_mapper["vision_tower"] = self.args.mm_vision_tower_lr if len(lr_mapper) > 0: special_lr_parameters = [name for name, _ in opt_model.named_parameters() if any(module_keyword in name for module_keyword in lr_mapper)] optimizer_grouped_parameters = [ { "params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in special_lr_parameters and p.requires_grad)], "weight_decay": self.args.weight_decay, }, { "params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in special_lr_parameters and p.requires_grad)], "weight_decay": 0.0, }, ] for module_keyword, lr in lr_mapper.items(): module_parameters = [name for name, _ in opt_model.named_parameters() if module_keyword in name] optimizer_grouped_parameters.extend( [ { "params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in module_parameters and p.requires_grad)], "weight_decay": self.args.weight_decay, "lr": lr, }, { "params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in module_parameters and p.requires_grad)], "weight_decay": 0.0, "lr": lr, }, ] ) else: optimizer_grouped_parameters = [ { "params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)], "weight_decay": self.args.weight_decay, }, { "params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)], "weight_decay": 0.0, }, ] optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args) self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs) if optimizer_cls.__name__ == "Adam8bit": import bitsandbytes manager = bitsandbytes.optim.GlobalOptimManager.get_instance() skipped = 0 for module in opt_model.modules(): if isinstance(module, nn.Embedding): skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values()) logger.info(f"skipped {module}: {skipped/2**20}M params") manager.register_module_override(module, "weight", {"optim_bits": 32}) logger.debug(f"bitsandbytes: will optimize {module} in fp32") logger.info(f"skipped: {skipped/2**20}M params") return self.optimizer def _save_checkpoint(self, model, trial, metrics=None): if getattr(self.args, "tune_mm_mlp_adapter", False) or ( hasattr(self.args, "mm_tunable_parts") and (len(self.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in self.args.mm_tunable_parts or "mm_vision_resampler" in self.args.mm_tunable_parts)) ): from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}" run_dir = self._get_output_dir(trial=trial) output_dir = os.path.join(run_dir, checkpoint_folder) # Only save Adapter keys_to_match = ["mm_projector", "vision_resampler"] if getattr(self.args, "use_im_start_end", False): keys_to_match.extend(["embed_tokens", "embed_in"]) weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match) if self.args.local_rank == 0 or self.args.local_rank == -1: self.model.config.save_pretrained(output_dir) torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin")) else: super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics) def _save(self, output_dir: Optional[str] = None, state_dict=None): if getattr(self.args, "tune_mm_mlp_adapter", False): pass else: super(LLaVATrainer, self)._save(output_dir, state_dict) ================================================ FILE: llava-train_videochat/llava/train/llava_trainer_eval.py ================================================ import json import subprocess from llava.train.llava_trainer import LLaVATrainer class LLaVAEvalTrainer(LLaVATrainer): def evaluate(self, evaluate_args): cmd = f"accelerate launch --num_processes {evaluate_args.eval_num_processes} -m lmms_eval \ --model {evaluate_args.model} \ --model_args {evaluate_args.model_args} \ --tasks {evaluate_args.task_names} \ --batch_size {evaluate_args.batch_size} \ --log_samples_suffix {evaluate_args.log_samples_suffix} \ --output_path {evaluate_args.output_path}" if evaluate_args.limit: cmd += f" --limit {evaluate_args.limit}" if evaluate_args.num_fewshot: cmd += f" --num_fewshot {evaluate_args.num_fewshot}" if evaluate_args.gen_kwargs != "": cmd += f" --gen_kwargs {evaluate_args.gen_kwargs}" if evaluate_args.log_samples: cmd += f" --log_samples" else: assert False, "Please log samples so that the result can be parsed" results = subprocess.run([cmd], shell=True, capture_output=True, text=True) try: result_file_index_start = results.stdout.index("Saved samples to ") result_file_index_end = results.stdout.index(f".json") result_file_index_start += len("Saved samples to ") file = results.stdout[result_file_index_start:result_file_index_end] except: result_file_index_start = results.stderr.index("Saved samples to ") result_file_index_end = results.stderr.index(f".json") result_file_index_start += len("Saved samples to ") file = results.stderr[result_file_index_start:result_file_index_end] file = file.split("/")[:-1] file = "/".join(file) + "/results.json" with open(file, "r") as f: lmms_eval_results = json.load(f) result_dict = {} tasks_list = evaluate_args.task_names.split(",") for task in tasks_list: task_results = lmms_eval_results["results"][task] for k, v in task_results.items(): if k != "alias" and "stderr" not in k: metric = k.split(",")[0] result_dict[f"{task}_{metric}"] = v return result_dict """def evaluate(self, evaluate_args): initialize_tasks() tasks_list = evaluate_args.task_names.split(",") result_dict = {} results = evaluator.simple_evaluate( model=evaluate_args.model, model_args=evaluate_args.model_args, tasks=tasks_list, num_fewshot=evaluate_args.num_fewshot, batch_size=evaluate_args.batch_size, device=evaluate_args.device, limit=evaluate_args.limit, check_integrity=evaluate_args.check_integrity, show_task_to_terminal=evaluate_args.show_task_to_terminal, log_samples=evaluate_args.log_samples, gen_kwargs=evaluate_args.gen_kwargs, cli_args=evaluate_args, ) for task in tasks_list: task_results = results["results"][task] for k,v in task_results.items(): if k != "alias" and "stderr" not in k: metric = k.split(",")[0] result_dict[f"{task}_{metric}"] = v return result_dict""" ================================================ FILE: llava-train_videochat/llava/train/train.py ================================================ # Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright: # Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright: # Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import ast import os import copy from dataclasses import dataclass, field import json import logging import pathlib from typing import Dict, Optional, Sequence, List from PIL import Image, ImageFile from packaging import version import numpy as np import gc import io import time import random import yaml import math import re import torch import transformers import tokenizers import deepspeed from transformers import AutoConfig from torch.utils.data import Dataset from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX from llava.train.llava_trainer import LLaVATrainer from llava import conversation as conversation_lib from llava.model import * from llava.mm_utils import process_highres_image, process_anyres_image, process_anyres_image_nopad, process_highres_image_crop_split, tokenizer_image_token, process_anyres_video_nopad from llava.utils import rank0_print from llava.video_utils import VIDEO_READER_FUNCS # from llava.serialize_utils import TorchShmSerializedList, get_rank, get_local_rank, local_broadcast_process_authkey # import wandb torch.multiprocessing.set_sharing_strategy("file_system") ImageFile.LOAD_TRUNCATED_IMAGES = True local_rank = None IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse("0.14") @dataclass class ModelArguments: model_name_or_path: Optional[str] = field(default="facebook/opt-125m") model_class_name: Optional[str] = field(default=None, metadata={"help": "Used to init model class, format is XXXXForCausalLM. e.g. currently XXXX is chosen from LlavaLlama, LlavaMixtral, LlavaMistral, Llama"}) mm_tunable_parts: Optional[str] = field( default=None, metadata={"help": 'Could be "mm_mlp_adapter", "mm_vision_resampler", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_mlp_adapter,mm_language_model"'} ) # deciding which part of the multimodal model to tune, will overwrite other previous settings version: Optional[str] = field(default="v0") freeze_backbone: bool = field(default=False) tune_mm_mlp_adapter: bool = field(default=False) tune_mm_vision_resampler: bool = field(default=False) vision_tower: Optional[str] = field(default=None) vision_tower_pretrained: Optional[str] = field(default=None) # default to the last layer vision_encode_type: Optional[str] = field(default="image") unfreeze_mm_vision_tower: bool = field(default=False) unfreeze_language_model: bool = field(default=False) mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer pretrain_mm_mlp_adapter: Optional[str] = field(default=None) mm_projector_type: Optional[str] = field(default="linear") mm_use_im_start_end: bool = field(default=False) mm_use_im_patch_token: bool = field(default=True) mm_patch_merge_type: Optional[str] = field(default="flat") mm_vision_select_feature: Optional[str] = field(default="patch") mm_resampler_type: Optional[str] = field(default=None) mm_mask_drop_mode: str = field(default="fixed") mm_mask_drop_skip_percentage: float = field(default=0.0) mm_mask_drop_ratio: float = field(default=0.25) mm_mask_drop_ratio_upper: Optional[float] = field(default=None) mm_mask_drop_ratio_lower: Optional[float] = field(default=None) mm_spatial_pool_stride: Optional[int] = field(default=None) mm_spatial_pool_mode: str = field(default="bilinear") mm_spatial_pool_out_channels: Optional[int] = field(default=None) mm_num_compress_latents: Optional[int] = field(default=128) mm_num_compress_query_type: Optional[str] = field(default='learnable') mm_pos_num_frames: Optional[int] = field(default=8) mm_close_init: Optional[bool] = field(default=False) min_slow_num_frames: Optional[int] = field(default=4) mm_perceiver_depth: Optional[int] = field(default=3) mm_perceiver_latents: Optional[int] = field(default=32) mm_perceiver_ff_mult: Optional[float] = field(default=4) mm_perceiver_pretrained: Optional[str] = field(default=None) mm_qformer_depth: Optional[int] = field(default=3) mm_qformer_latents: Optional[int] = field(default=32) mm_qformer_pretrained: Optional[str] = field(default=None) rope_scaling_factor: Optional[float] = field(default=None) rope_scaling_type: Optional[str] = field(default=None) s2: Optional[bool] = field(default=False) s2_scales: Optional[str] = field(default="336,672,1008") use_pos_skipping: Optional[bool] = field(default=False) pos_skipping_range: Optional[int] = field(default=4096) mm_newline_position: Optional[str] = field(default="one_token") # for frame separate mm_local_num_frames: Optional[int] = field(default=-1) # 用来控制video encoder和projector是否分段处理时间序列 mm_llm_compress: Optional[bool] = field(default=False) llm_compress_type: Optional[str] = field(default="attention") llm_compress_layer_list: Optional[str] = field(default="8,16,24") llm_image_token_ratio_list: Optional[str] = field(default="1.0,0.5,0.25,0.125") # 增加新模型参数的记得去下面overwrite_config注册 @dataclass class DataArguments: data_path: str = field(default=None, metadata={"help": "Path to the training data, in llava's instruction.json format. Supporting multiple json files via /path/to/{a,b,c}.json"}) lazy_preprocess: bool = False is_multimodal: bool = False early_mix_text: bool = False # image_folder: Optional[str] = field(default=None) image_aspect_ratio: str = "square" image_grid_pinpoints: Optional[str] = field(default=None) image_crop_resolution: Optional[int] = field(default=None) # 好像没啥用 image_split_resolution: Optional[int] = field(default=None) # 好像没啥用 frame_aspect_ratio: str = "square" frame_grid_pinpoints: Optional[str] = field(default=None) max_num_pixels: int = 14745600000 # 384*384*100000 # video_folder: Optional[str] = field(default=None) # video_fps: Optional[int] = field(default=1) frames_upbound: Optional[int] = field(default=8) frames_lowbound: Optional[int] = field(default=1) # 注意当视频实在没有这么多帧的时候还是会低于lowbound time_msg: Optional[str] = field(default=None) local_num_frames: Optional[int] = field(default=8) sample_type: Optional[str] = field(default='middle') @dataclass class TrainingArguments(transformers.TrainingArguments): cache_dir: Optional[str] = field(default=None) optim: str = field(default="adamw_torch") remove_unused_columns: bool = field(default=False) freeze_mm_mlp_adapter: bool = field(default=False) freeze_mm_vision_resampler: bool = field(default=False) mpt_attn_impl: Optional[str] = field(default="triton") model_max_length: int = field( default=4096, metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."}, ) double_quant: bool = field(default=True, metadata={"help": "Compress the quantization statistics through double quantization."}) quant_type: str = field(default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}) bits: int = field(default=16, metadata={"help": "How many bits to use."}) lora_enable: bool = False lora_r: int = 64 lora_alpha: int = 16 lora_dropout: float = 0.05 lora_weight_path: str = "" lora_bias: str = "none" mm_projector_lr: Optional[float] = None mm_vision_tower_lr: Optional[float] = None group_by_varlen: bool = field(default=False) group_by_modality_length: bool = field(default=False) group_by_modality_length_auto: bool = field(default=False) auto_find_batch_size: bool = field(default=False) gradient_checkpointing: bool = field(default=True) verbose_logging: bool = field(default=True) attn_implementation: str = field(default="flash_attention_2", metadata={"help": "Use transformers attention implementation."}) # @dataclass # class EvaluationArguments: # eval_num_processes: int = field(default=1) # task_names: str = field(default=None) # model: str = field(default="llava") # model_args: Optional[str] = field(default=None) # num_fewshot: Optional[int] = field(default=None) # batch_size: int = field(default=1) # device: Optional[str] = field(default=None) # limit: Optional[int] = field(default=None) # check_integrity: Optional[bool] = field(default=False) # show_task_to_terminal: Optional[bool] = field(default=False) # log_samples: Optional[bool] = field(default=True) # gen_kwargs: Optional[str] = field(default="") # log_samples_suffix: Optional[str] = field(default="") # output_path: Optional[str] = field(default="./logs/") def maybe_zero_3(param, ignore_status=False, name=None): from deepspeed import zero from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus if hasattr(param, "ds_id"): if param.ds_status == ZeroParamStatus.NOT_AVAILABLE: if not ignore_status: logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}") with zero.GatheredParameters([param]): param = param.data.detach().cpu().clone() else: param = param.detach().cpu().clone() return param # Borrowed from peft.utils.get_peft_model_state_dict def get_peft_state_maybe_zero_3(named_params, bias): if bias == "none": to_return = {k: t for k, t in named_params if "lora_" in k} elif bias == "all": to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k} elif bias == "lora_only": to_return = {} maybe_lora_bias = {} lora_bias_names = set() for k, t in named_params: if "lora_" in k: to_return[k] = t bias_name = k.split("lora_")[0] + "bias" lora_bias_names.add(bias_name) elif "bias" in k: maybe_lora_bias[k] = t for k, t in maybe_lora_bias: if bias_name in lora_bias_names: to_return[bias_name] = t else: raise NotImplementedError to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()} return to_return def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True): to_return = {k: t for k, t in named_params if "lora_" not in k} if require_grad_only: to_return = {k: t for k, t in to_return.items() if t.requires_grad} to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} return to_return def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match): to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)} to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()} return to_return def find_all_linear_names(model): cls = torch.nn.Linear lora_module_names = set() multimodal_keywords = ["mm_projector", "vision_tower", "vision_resampler"] for name, module in model.named_modules(): if any(mm_keyword in name for mm_keyword in multimodal_keywords): continue if isinstance(module, cls): names = name.split(".") lora_module_names.add(names[0] if len(names) == 1 else names[-1]) if "lm_head" in lora_module_names: # needed for 16-bit lora_module_names.remove("lm_head") return list(lora_module_names) def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str): """Collects the state dict and dump to disk.""" if hasattr(trainer.args, "tune_mm_mlp_adapter") and trainer.args.tune_mm_mlp_adapter: check_only_save_mm_adapter_tunnable = True # only has mm_mlp_adapter and mm_vision_resampler in the tuneable parts elif hasattr(trainer.args, "mm_tunable_parts") and (len(trainer.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in trainer.args.mm_tunable_parts or "mm_vision_resampler" in trainer.args.mm_tunable_parts)): check_only_save_mm_adapter_tunnable = True else: check_only_save_mm_adapter_tunnable = False trainer.accelerator.wait_for_everyone() torch.cuda.synchronize() rank0_print(f"Only save projectors: {check_only_save_mm_adapter_tunnable}") if check_only_save_mm_adapter_tunnable: # Only save Adapter keys_to_match = ["mm_projector", "vision_resampler"] if getattr(trainer.args, "use_im_start_end", False): keys_to_match.extend(["embed_tokens", "embed_in"]) weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match) trainer.model.config.save_pretrained(output_dir) current_folder = output_dir.split("/")[-1] parent_folder = os.path.dirname(output_dir) if trainer.args.local_rank == 0 or trainer.args.local_rank == -1: if current_folder.startswith("checkpoint-"): mm_projector_folder = os.path.join(parent_folder, "mm_projector") os.makedirs(mm_projector_folder, exist_ok=True) torch.save(weight_to_save, os.path.join(mm_projector_folder, f"{current_folder}.bin")) else: torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin")) return if trainer.deepspeed: trainer.save_model(output_dir) return state_dict = trainer.model.state_dict() if trainer.args.should_save: cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()} del state_dict trainer._save(output_dir, state_dict=cpu_state_dict) # noqa def smart_tokenizer_and_embedding_resize( special_tokens_dict: Dict, tokenizer: transformers.PreTrainedTokenizer, model: transformers.PreTrainedModel, ): """Resize tokenizer and embedding. Note: This is the unoptimized version that may make your embedding size not be divisible by 64. """ num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) model.resize_token_embeddings(len(tokenizer)) if num_new_tokens > 0: input_embeddings = model.get_input_embeddings().weight.data output_embeddings = model.get_output_embeddings().weight.data input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) input_embeddings[-num_new_tokens:] = input_embeddings_avg output_embeddings[-num_new_tokens:] = output_embeddings_avg def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict: """Tokenize a list of strings.""" tokenized_list = [ tokenizer( text, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ) for text in strings ] input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list] input_ids_lens = labels_lens = [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list] return dict( input_ids=input_ids, labels=labels, input_ids_lens=input_ids_lens, labels_lens=labels_lens, ) def _mask_targets(target, tokenized_lens, speakers): # cur_idx = 0 cur_idx = tokenized_lens[0] tokenized_lens = tokenized_lens[1:] target[:cur_idx] = IGNORE_INDEX for tokenized_len, speaker in zip(tokenized_lens, speakers): if speaker == "human": target[cur_idx + 2 : cur_idx + tokenized_len] = IGNORE_INDEX cur_idx += tokenized_len def _add_speaker_and_signal(header, source, get_conversation=True): """Add speaker and start/end signal on each round.""" BEGIN_SIGNAL = "### " END_SIGNAL = "\n" conversation = header for sentence in source: from_str = sentence["from"] if from_str.lower() == "human": from_str = conversation_lib.default_conversation.roles[0] elif from_str.lower() == "gpt": from_str = conversation_lib.default_conversation.roles[1] else: from_str = "unknown" sentence["value"] = BEGIN_SIGNAL + from_str + ": " + sentence["value"] + END_SIGNAL if get_conversation: conversation += sentence["value"] conversation += BEGIN_SIGNAL return conversation def preprocess_multimodal(sources: Sequence[str], data_args: DataArguments, msg="") -> Dict: is_multimodal = data_args.is_multimodal if not is_multimodal: return sources for source in sources: for sentence in source: # TODO maybe this should be changed for interleaved data? # if DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN): # only check for num_im=1 num_im = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"])) if num_im == 1 and DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN): sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip() sentence["value"] = DEFAULT_IMAGE_TOKEN + "\n" + sentence["value"] sentence["value"] = sentence["value"].strip() if "mmtag" in conversation_lib.default_conversation.version: sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "" + DEFAULT_IMAGE_TOKEN + "") replace_token = DEFAULT_IMAGE_TOKEN if data_args.mm_use_im_start_end: replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN if msg.rstrip() != "": replace_token = replace_token + msg.rstrip() + " " # NOTE for time msg of video sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token) # For videoInstruct-100k noisy_data. TODO: Ask Yuanhan to clean the data instead of leaving the noise code here. sentence["value"] = sentence["value"].replace("QA_GT_caption_based_noisy", "") return sources def preprocess_llama_2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]} # Apply prompt templates conversations = [] for i, source in enumerate(sources): if roles[source[0]["from"]] != conv.roles[0]: # Skip the first one if it is not from human source = source[1:] conv.messages = [] for j, sentence in enumerate(source): role = roles[sentence["from"]] assert role == conv.roles[j % 2], f"{i}" conv.append_message(role, sentence["value"]) conversations.append(conv.get_prompt()) # Tokenize conversations if has_image: input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0) else: input_ids = tokenizer( conversations, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ).input_ids targets = input_ids.clone() assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2 # Mask targets sep = "[/INST] " for conversation, target in zip(conversations, targets): total_len = int(target.ne(tokenizer.pad_token_id).sum()) rounds = conversation.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_INDEX for i, rou in enumerate(rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep if has_image: round_len = len(tokenizer_image_token(rou, tokenizer)) instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2 else: round_len = len(tokenizer(rou).input_ids) instruction_len = len(tokenizer(parts[0]).input_ids) - 2 target[cur_len : cur_len + instruction_len] = IGNORE_INDEX cur_len += round_len target[cur_len:] = IGNORE_INDEX if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_INDEX print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)") return dict( input_ids=input_ids, labels=targets, ) def preprocess_gemma(sources: List[List[Dict[str, str]]], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict: conv: conversation_lib.Conversation = conversation_lib.default_conversation.copy() roles: Dict[str, str] = {"human": conv.roles[0], "gpt": conv.roles[1]} # Apply prompt templates conversations: List[str] = [] for i, source in enumerate(sources): if roles[source[0]["from"]] != conv.roles[0]: # Skip the first one if it is not from human source: List[Dict[str, str]] = source[1:] conv.messages = [] for j, sentence in enumerate(source): role: str = roles[sentence["from"]] assert role == conv.roles[j % 2], f"{i}" conv.append_message(role, sentence["value"]) conversations.append(conv.get_prompt()) # Tokenize conversations if has_image: input_ids: torch.Tensor = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0) else: input_ids: torch.Tensor = tokenizer( conversations, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ).input_ids targets: torch.Tensor = input_ids.clone() assert conv.sep_style == conversation_lib.SeparatorStyle.GEMMA # Mask target sep: str = conv.sep + conv.roles[1] for conversation, target in zip(conversations, targets): total_len: int = int(target.ne(tokenizer.pad_token_id).sum()) rounds: List[str] = conversation.split(conv.sep) re_rounds = [] for conv_idx in range(0, len(rounds), 2): re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) cur_len = 1 # Ignore target[:cur_len] = IGNORE_INDEX for i, rou in enumerate(re_rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep # Re-append sep because split on this # Now "".join(parts)==rou if has_image: round_len = len(tokenizer_image_token(rou, tokenizer)) - 1 # Ignore instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 # Ignore else: round_len = len(tokenizer(rou).input_ids) - 1 # Ignore instruction_len = len(tokenizer(parts[0]).input_ids) - 1 # Ignore round_len += 2 # sep: \n takes 2 tokens target[cur_len : cur_len + instruction_len] = IGNORE_INDEX cur_len += round_len target[cur_len:] = IGNORE_INDEX if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_INDEX print(f"warning: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)") return dict( input_ids=input_ids, labels=targets, ) def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: # roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} roles = {"human": "user", "gpt": "assistant"} # Add image tokens to tokenizer as a special tokens # Use a deepcopy of tokenizer so that we don't modify on the tokenizer tokenizer = copy.deepcopy(tokenizer) # When there is actually an image, we add the image tokens as a special token if has_image: tokenizer.add_tokens([""], special_tokens=True) image_token_index = tokenizer.convert_tokens_to_ids("") im_start, im_end = tokenizer.additional_special_tokens_ids[0:2] # for qwen2_5 # unmask_tokens = ["<|im_start|>", "<|im_start|>", "\n"] unmask_tokens_idx = [198, im_start, im_end] nl_tokens = tokenizer("\n").input_ids # Reset Qwen chat templates so that it won't include system message every time we apply chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" tokenizer.chat_template = chat_template # _system = tokenizer("system").input_ids + nl_tokens # _user = tokenizer("user").input_ids + nl_tokens # _assistant = tokenizer("assistant").input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] for i, source in enumerate(sources): if roles[source[0]["from"]] != roles["human"]: source = source[1:] input_id, target = [], [] # New version, use apply chat template # Build system message for each sentence input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}]) target += [IGNORE_INDEX] * len(input_id) for conv in source: # Make sure llava data can load try: role = conv["role"] content = conv["content"] except: role = conv["from"] content = conv["value"] role = roles.get(role, role) conv = [{"role" : role, "content" : content}] encode_id = tokenizer.apply_chat_template(conv) input_id += encode_id if role in ["user", "system"]: target += [IGNORE_INDEX] * len(encode_id) else: target += encode_id assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}" for idx, encode_id in enumerate(input_id): if encode_id in unmask_tokens_idx: target[idx] = encode_id if encode_id == image_token_index: input_id[idx] = IMAGE_TOKEN_INDEX input_ids.append(input_id) targets.append(target) input_ids = torch.tensor(input_ids, dtype=torch.long) targets = torch.tensor(targets, dtype=torch.long) del tokenizer return dict( input_ids=input_ids, # tensor(bs x seq_len) labels=targets, # tensor(bs x seq_len) ) def preprocess_internlm2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: # roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} roles = {"human": "user", "gpt": "assistant"} # Add image tokens to tokenizer as a special tokens # Use a deepcopy of tokenizer so that we don't modify on the tokenizer tokenizer = copy.deepcopy(tokenizer) # When there is actually an image, we add the image tokens as a special token if has_image: tokenizer.add_tokens([""], special_tokens=True) image_token_index = tokenizer.convert_tokens_to_ids("") unmask_tokens = ["<|im_start|>", "<|im_end|>", "<|action_start|>", "<|action_end|>", "<|interpreter|>", "<|plugin|>"] unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens] # nl_tokens = tokenizer("\n").input_ids # chat_template = "{{ bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" # tokenizer.chat_template = chat_template # _system = tokenizer("system").input_ids + nl_tokens # _user = tokenizer("user").input_ids + nl_tokens # _assistant = tokenizer("assistant").input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] for i, source in enumerate(sources): if roles[source[0]["from"]] != roles["human"]: source = source[1:] input_id, target = [], [] # New version, use apply chat template # Build system message for each sentence input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}]) target += [IGNORE_INDEX] * len(input_id) for conv in source: # Make sure llava data can load try: role = conv["role"] content = conv["content"] except: role = conv["from"] content = conv["value"] role = roles.get(role, role) conv = [{"role" : role, "content" : content}] encode_id = tokenizer.apply_chat_template(conv) input_id += encode_id if role in ["user", "system"]: target += [IGNORE_INDEX] * len(encode_id) else: target += encode_id assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}" for idx, encode_id in enumerate(input_id): if encode_id in unmask_tokens_idx: target[idx] = encode_id if encode_id == image_token_index: input_id[idx] = IMAGE_TOKEN_INDEX input_ids.append(input_id) targets.append(target) input_ids = torch.tensor(input_ids, dtype=torch.long) targets = torch.tensor(targets, dtype=torch.long) del tokenizer return dict( input_ids=input_ids, # tensor(bs x seq_len) labels=targets, # tensor(bs x seq_len) ) def preprocess_llama3( sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.", ) -> Dict: # roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"} roles = {"human": "user", "gpt": "assistant"} # Add image tokens to tokenizer as a special tokens # Use a deepcopy of tokenizer so that we don't modify on the tokenizer tokenizer = copy.deepcopy(tokenizer) # When there is actually an image, we add the image tokens as a special token if has_image: tokenizer.add_tokens([""], special_tokens=True) image_token_index = tokenizer.convert_tokens_to_ids("") bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>") start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>") end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>") eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>") unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"] unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens] # After update, calling tokenizer of llama3 will # auto add bos id for the tokens. ヽ(`⌒´)ノ def safe_tokenizer_llama3(text): input_ids = tokenizer(text).input_ids if input_ids[0] == bos_token_id: input_ids = input_ids[1:] return input_ids nl_tokens = tokenizer.convert_tokens_to_ids("\n\n") # Apply prompt templates input_ids, targets = [], [] for i, source in enumerate(sources): if roles[source[0]["from"]] != roles["human"]: source = source[1:] input_id, target = [], [] # New version, use apply chat template # Build system message for each sentence input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}]) target += [IGNORE_INDEX] * len(input_id) for conv in source: # Make sure llava data can load try: role = conv["role"] content = conv["content"] except: role = conv["from"] content = conv["value"] role = roles.get(role, role) conv = [{"role" : role, "content" : content}] # First is bos token we don't need here encode_id = tokenizer.apply_chat_template(conv)[1:] input_id += encode_id if role in ["user", "system"]: target += [IGNORE_INDEX] * len(encode_id) else: target += encode_id assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}" for idx, encode_id in enumerate(input_id): if encode_id in unmask_tokens_idx: target[idx] = encode_id if encode_id == image_token_index: input_id[idx] = IMAGE_TOKEN_INDEX input_ids.append(input_id) targets.append(target) input_ids = torch.tensor(input_ids, dtype=torch.long) targets = torch.tensor(targets, dtype=torch.long) return dict( input_ids=input_ids, # tensor(bs x seq_len) labels=targets, # tensor(bs x seq_len) ) def preprocess_v1(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]} # Apply prompt templates conversations = [] for i, source in enumerate(sources): if roles[source[0]["from"]] != conv.roles[0]: # Skip the first one if it is not from human source = source[1:] conv.messages = [] for j, sentence in enumerate(source): role = roles[sentence["from"]] assert role == conv.roles[j % 2], f"{i}" conv.append_message(role, sentence["value"]) conversations.append(conv.get_prompt()) # Tokenize conversations if has_image: input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0) else: input_ids = tokenizer( conversations, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ).input_ids targets = input_ids.clone() assert conv.sep_style == conversation_lib.SeparatorStyle.TWO # Mask targets sep = conv.sep + conv.roles[1] + ": " for conversation, target in zip(conversations, targets): total_len = int(target.ne(tokenizer.pad_token_id).sum()) rounds = conversation.split(conv.sep2) cur_len = 1 target[:cur_len] = IGNORE_INDEX for i, rou in enumerate(rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep if has_image: round_len = len(tokenizer_image_token(rou, tokenizer)) instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2 else: round_len = len(tokenizer(rou).input_ids) instruction_len = len(tokenizer(parts[0]).input_ids) - 2 if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14: round_len -= 1 instruction_len -= 1 target[cur_len : cur_len + instruction_len] = IGNORE_INDEX cur_len += round_len target[cur_len:] = IGNORE_INDEX if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_INDEX print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)") return dict( input_ids=input_ids, labels=targets, ) def preprocess_mpt(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]} # Apply prompt templates conversations = [] for i, source in enumerate(sources): if roles[source[0]["from"]] != conv.roles[0]: # Skip the first one if it is not from human source = source[1:] conv.messages = [] for j, sentence in enumerate(source): role = roles[sentence["from"]] assert role == conv.roles[j % 2], f"{i}" conv.append_message(role, sentence["value"]) conversations.append(conv.get_prompt()) # Tokenize conversations if has_image: input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0) else: input_ids = tokenizer( conversations, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ).input_ids targets = input_ids.clone() assert conv.sep_style == conversation_lib.SeparatorStyle.MPT # Mask targets sep = conv.sep + conv.roles[1] for conversation, target in zip(conversations, targets): total_len = int(target.ne(tokenizer.pad_token_id).sum()) rounds = conversation.split(conv.sep) re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt for conv_idx in range(3, len(rounds), 2): re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) # user + gpt cur_len = 1 target[:cur_len] = IGNORE_INDEX for i, rou in enumerate(re_rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep if has_image: round_len = len(tokenizer_image_token(rou, tokenizer)) instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 else: round_len = len(tokenizer(rou).input_ids) instruction_len = len(tokenizer(parts[0]).input_ids) - 1 if i != 0 and getattr(tokenizer, "legacy", False) and IS_TOKENIZER_GREATER_THAN_0_14: round_len += 1 instruction_len += 1 target[cur_len : cur_len + instruction_len] = IGNORE_INDEX cur_len += round_len target[cur_len:] = IGNORE_INDEX if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_INDEX print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f"(#turns={len(re_rounds)} ignored)") return dict( input_ids=input_ids, labels=targets, ) def preprocess_plain( sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, ) -> Dict: # add end signal and concatenate together conversations = [] for source in sources: assert len(source) == 2 assert DEFAULT_IMAGE_TOKEN in source[0]["value"] source[0]["value"] = DEFAULT_IMAGE_TOKEN conversation = source[0]["value"] + source[1]["value"] + conversation_lib.default_conversation.sep conversations.append(conversation) # tokenize conversations input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations] targets = copy.deepcopy(input_ids) for target, source in zip(targets, sources): tokenized_len = len(tokenizer_image_token(source[0]["value"], tokenizer)) target[:tokenized_len] = IGNORE_INDEX return dict(input_ids=input_ids, labels=targets) def preprocess(sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict: """ Given a list of sources, each is a conversation list. This transform: 1. Add signal '### ' at the beginning each sentence, with end signal '\n'; 2. Concatenate conversations together; 3. Tokenize the concatenated conversation; 4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX. """ if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN: return preprocess_plain(sources, tokenizer) if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2: return preprocess_llama_2(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version.startswith("v1"): return preprocess_v1(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version == "mpt": return preprocess_mpt(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version == "qwen": return preprocess_qwen(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version == "internlm_2": return preprocess_internlm2(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version == "gemma": return preprocess_gemma(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.version == "llama_v3": return preprocess_llama3(sources, tokenizer, has_image=has_image) # add end signal and concatenate together conversations = [] for source in sources: header = f"{conversation_lib.default_conversation.system}\n\n" conversation = _add_speaker_and_signal(header, source) conversations.append(conversation) # tokenize conversations def get_tokenize_len(prompts): return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts] if has_image: input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations] else: conversations_tokenized = _tokenize_fn(conversations, tokenizer) input_ids = conversations_tokenized["input_ids"] targets = copy.deepcopy(input_ids) for target, source in zip(targets, sources): if has_image: tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source]) else: tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"] speakers = [sentence["from"] for sentence in source] _mask_targets(target, tokenized_lens, speakers) return dict(input_ids=input_ids, labels=targets) class LazySupervisedDataset(Dataset): def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, data_args: DataArguments): super(LazySupervisedDataset, self).__init__() self.tokenizer = tokenizer self.num_video_tokens = max(8, data_args.frames_upbound) * 128 // 8 # 估计一下video token 数量 try: from petrel_client.client import Client has_client = True except ImportError: has_client = False if has_client: self.client = Client('~/petreloss.conf') else: self.client = None # if get_local_rank() == 0: list_data_dict = [] # Handle multiple JSON files specified in the data_path if "{" in data_path and "}" in data_path: raise NotImplementedError("Please use .yaml!!!") base_path, file_pattern = re.match(r"^(.*)\{(.*)\}\.json$", data_path).groups() file_names = file_pattern.split(",") rank0_print(f"Loading {file_names} from {base_path}") data_args.dataset_paths = [] for file_name in file_names: data_args.dataset_paths.append(f"{base_path}{file_name}.json") full_path = f"{base_path}{file_name}.json" rank0_print(f"Loading {full_path}") with open(full_path, "r") as file: cur_data_dict = json.load(file) rank0_print(f"Loaded {len(cur_data_dict)} samples from {full_path}") list_data_dict.extend(cur_data_dict) elif data_path.endswith(".yaml"): with open(data_path, "r") as file: yaml_data = yaml.safe_load(file) datasets = yaml_data.get("datasets") # file should be in the format of: # datasets: # - json_path: xxxx1.json # sampling_strategy: first:1000 # - json_path: xxxx2.json # sampling_strategy: end:3000 # - json_path: xxxx3.json # sampling_strategy: random:999 # data_args.dataset_paths = [dataset.get("json_path") for dataset in datasets] # NOTE for dataset in datasets: json_path = dataset.get("json_path") sampling_strategy = dataset.get("sampling_strategy", "all") sampling_number = None rank0_print(f"Loading {json_path} with {sampling_strategy} sampling strategy") if json_path.endswith(".jsonl"): cur_data_dict = [] if "s3://" in json_path: with io.BytesIO(self.client.get(json_path)) as json_file: for line in json_file: cur_data_dict.append(json.loads(line.strip())) else: with open(json_path, "r") as json_file: for line in json_file: cur_data_dict.append(json.loads(line.strip())) elif json_path.endswith(".json"): if "s3://" in json_path: with io.BytesIO(self.client.get(json_path)) as json_file: cur_data_dict = json.load(json_file) else: with open(json_path, "r") as json_file: cur_data_dict = json.load(json_file) else: raise ValueError(f"Unsupported file type: {json_path}") assert len(cur_data_dict) > 0, cur_data_dict media_type = dataset.get("media_type", None) if media_type is None: if 'image' in cur_data_dict[0].keys(): # NOTE 碰到混合数据可能会出错 media_type = 'image' elif 'video' in cur_data_dict[0].keys(): media_type = 'video' else: media_type = 'text' if ":" in sampling_strategy: sampling_strategy, sampling_number = sampling_strategy.split(":") if "%" in sampling_number: sampling_number = math.ceil(int(sampling_number.split("%")[0]) * len(cur_data_dict) / 100) else: sampling_number = int(sampling_number) # Apply the sampling strategy if sampling_strategy == "first" and sampling_number is not None: cur_data_dict = cur_data_dict[:sampling_number] rank0_print(f"sampling_strategy={sampling_strategy}, {0}:{sampling_number}") elif sampling_strategy == "first2" and sampling_number is not None: cur_data_dict = cur_data_dict[sampling_number:sampling_number*2] rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number}:{sampling_number*2}") elif sampling_strategy == "first3" and sampling_number is not None: cur_data_dict = cur_data_dict[sampling_number*2:sampling_number*3] rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*2}:{sampling_number*3}") elif sampling_strategy == "first4" and sampling_number is not None: cur_data_dict = cur_data_dict[sampling_number*3:sampling_number*4] rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*3}:{sampling_number*4}") elif sampling_strategy == "end" and sampling_number is not None: cur_data_dict = cur_data_dict[-sampling_number:] rank0_print(f"sampling_strategy={sampling_strategy}, {-sampling_number}:-") elif sampling_strategy == "random" and sampling_number is not None: raise NotImplementedError("Don't use random") random.shuffle(cur_data_dict) cur_data_dict = cur_data_dict[:sampling_number] video_read_type = dataset.get("video_read_type", None) data_root = dataset.get("data_root", "") # try: # post-process meta info if media_type not in ['text', 'mix']: def check_pnorm2(ori_path): # TODO ugly code, remove it after clean anno file if ori_path.startswith("pnorm2:s3://") or ori_path.startswith("p2:s3://") or ori_path.startswith("pssd:s3://"): old_bucket_name = ori_path.split('://')[1].split('/')[0] data_prefix = ori_path.split('://')[0] data_path = '/'.join(ori_path.split('://')[1].split('/')[1:]) # new_bucket_name = old_bucket_name.replace('-', '_').lower() new_bucket_name = old_bucket_name.lower() return data_prefix + '://' + new_bucket_name + '/' + data_path else: return ori_path for i in range(len(cur_data_dict)): if video_read_type != None: cur_data_dict[i]['video_read_type'] = video_read_type if type(cur_data_dict[i][media_type]) is list: new_data_path = [] for old_data_path in cur_data_dict[i][media_type]: new_data_path.append(os.path.join(data_root, old_data_path)) # new_data_path.append(check_pnorm2(os.path.join(data_root, old_data_path))) cur_data_dict[i][media_type] = new_data_path else: cur_data_dict[i][media_type] = os.path.join(data_root, cur_data_dict[i][media_type]) # cur_data_dict[i][media_type] = check_pnorm2(os.path.join(data_root, cur_data_dict[i][media_type])) rank0_print(f"Check samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}") if media_type not in ['text', 'mix'] and video_read_type != 'fake': ok = False for i in range(3): data_path = cur_data_dict[i][media_type] if type(data_path) is list: data_path = data_path[0] rank0_print(f"Checking: {data_path}") if 's3://' in data_path: if media_type == 'video' and video_read_type in ['img', 'frame']: for path in self.client.list(data_path): ok = True break else: tmp_data = self.client.get(data_path) if tmp_data is not None and len(tmp_data) > 0: ok = True else: if os.path.exists(data_path): ok = True if ok: break assert ok, f"Data in {data_path} can't be read!" rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}") # except Exception as e: # rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, data_root={data_root}, something maybe wrong {e}!!!") list_data_dict.extend(cur_data_dict) else: raise NotImplementedError("Please use .yaml!!!") data_args.dataset_paths = [data_path] rank0_print(f"Loading {data_path}") with open(data_path, "r") as file: cur_data_dict = json.load(file) rank0_print(f"Loaded {len(cur_data_dict)} samples from {data_path}") list_data_dict.extend(cur_data_dict) # else: # list_data_dict = [] self.list_data_dict = list_data_dict # self.list_data_dict = TorchShmSerializedList(list_data_dict) rank0_print(f"Loaded {len(self.list_data_dict)} samples from {data_path}") rank0_print("Formatting inputs...Skip in lazy mode") self.tokenizer = tokenizer self.data_args = data_args def __len__(self): return len(self.list_data_dict) @property def lengths(self): length_list = [] for sample in self.list_data_dict: if "image" in sample: img_tokens = 128 elif "video" in sample: img_tokens = self.num_video_tokens else: img_tokens = 0 length_list.append(sum(len(conv["value"].split()) for conv in sample["conversations"]) + img_tokens) return length_list @property def modality_lengths(self): length_list = [] for sample in self.list_data_dict: cur_len = sum(len(conv["value"].split()) for conv in sample["conversations"]) assert cur_len > 0, f"Conversation length is 0 for {sample}" if "image" in sample or "video" in sample or self.data_args.early_mix_text: length_list.append(cur_len) else: length_list.append(-cur_len) return length_list def process_image(self, image_file, overwrite_image_aspect_ratio=None): # image_folder = self.data_args.image_folder # start_time = time.time() processor = self.data_args.image_processor # print(f"\n\nInspecting the image path, image_file={image_file}") try: if 's3://' in image_file: value = self.client.Get(image_file) img_bytes = np.frombuffer(value, dtype=np.uint8) with io.BytesIO(img_bytes) as buff: image = Image.open(buff).convert('RGB') else: image = Image.open(image_file).convert('RGB') # PIL Image except Exception as exn: print(f"Failed to open image {image_file}. Exception:", exn) raise exn image_size = image.size image_aspect_ratio = self.data_args.image_aspect_ratio if overwrite_image_aspect_ratio is not None: image_aspect_ratio = overwrite_image_aspect_ratio if image_aspect_ratio == "highres": raise NotImplementedError image = process_highres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints) # elif image_aspect_ratio == "anyres" or "anyres_max" in image_aspect_ratio: elif "anyres" in image_aspect_ratio: if 'nopad' in image_aspect_ratio: image = process_anyres_image_nopad(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints) else: raise NotImplementedError image = process_anyres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints) elif image_aspect_ratio == "crop_split": raise NotImplementedError image = process_highres_image_crop_split(image, self.data_args) elif image_aspect_ratio == "pad": def expand2square(pil_img, background_color): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean)) image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0] else: image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0] # end_time = time.time() # print(image_file, end_time - start_time) # print(f"OK, image_file={image_file}\n\n") return image, image_size, "image" def process_video(self, video_file, data_anno, data_args): # print(f"\n\nInspecting the video path, video_file={video_file}\n\n", flush=True) # logging.info(f"\n\nInspecting the video path, video_file={video_file}\n\n") # start_time = time.time() local_num_frames = data_args.local_num_frames max_num_frames = data_args.frames_upbound min_num_frames = data_args.frames_lowbound sample_type = data_args.sample_type video_reader_type = data_anno.get("video_read_type", "decord") if "start" in data_anno and "end" in data_anno: clip = [float(data_anno["start"]), float(data_anno["end"])] else: clip = None if clip is None or video_reader_type == "img": video_reader = VIDEO_READER_FUNCS[video_reader_type] frames, frame_indices, fps, duration = video_reader( video_file, max_num_frames, sample_type, min_num_frames=min_num_frames, max_num_frames=max_num_frames, client=self.client, clip=clip, local_num_frames=local_num_frames ) # if sample_type in ['rand', 'middle'] and len(frames) < local_num_frames and len(frames) != max_num_frames: # raise ValueError(f"{video_file} only have {len(frames)} frames!!!") # logger.info(f"{data_path} is OK!!!!") else: video_reader = VIDEO_READER_FUNCS['lazy'] start, end = clip duration = end - start if min_num_frames > duration: min_num_frames = (duration // local_num_frames) * local_num_frames if sample_type == 'dynamic_fps1': num_segments = int(duration // local_num_frames) if num_segments == 0: num_frames = local_num_frames else: num_frames = local_num_frames * num_segments num_frames = min(num_frames, max_num_frames) num_frames = max(num_frames, min_num_frames) else: num_frames = max_num_frames frames, frame_indices, fps = video_reader(video_file, num_frames=num_frames, video_start=start, video_end=end, client=self.client) # logger.info(f"{data_path} is OK, duation={end-start} num_frames={num_frames}!!!!") if sample_type == 'dynamic_fps1' and len(frames) % local_num_frames != 0: raise ValueError(f"min_num_frames={min_num_frames}, max_num_frames={max_num_frames}, local_num_frames={local_num_frames}, len(frames)={len(frames)}, is wrong!!!") sec = [str(round(f / fps, 1)) for f in frame_indices] if data_args.time_msg is not None and sec is not None: if data_args.time_msg == 'short': msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled from it. " else: # " " should be added in the start and end msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled at {', '.join(sec)} seconds. " else: msg = "" # logging.info(f"OK, video_file={video_file}\n\n") # print(f"OK, video_file={video_file}\n\n", flush=True) # end_time = time.time() # print(video_file, end_time - start_time) return frames, msg def __getitem__(self, i) -> Dict[str, torch.Tensor]: # TODO: define number of retries somewhere else num_base_retries = 2 num_final_retries = 300 # try the current sample first for attempt_idx in range(num_base_retries): try: sample = self._get_item(i) return sample except Exception as e: # sleep 1s in case it is a cloud disk issue print(f"[Try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e) if attempt_idx != (num_base_retries -1): time.sleep(1) retry_step = 5 # try other samples, in case it is file corruption issue for attempt_idx in range(num_base_retries+3): try: next_index = min(i + retry_step, len(self.list_data_dict) - 1) # sample_idx = random.choice(range(len(self))) sample = self._get_item(next_index) return sample except Exception as e: # no need to sleep print(f"[Try other #{attempt_idx}] Failed to fetch sample {next_index}. Exception:", e) retry_step *= 2 try: sample = self._get_item(i) return sample except Exception as e: raise e def _get_item(self, i) -> Dict[str, torch.Tensor]: sources = self.list_data_dict[i] if isinstance(i, int): sources = [sources] else: raise NotImplementedError(i) assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME if "image" in sources[0]: image_file = self.list_data_dict[i]["image"] if type(image_file) is list: # Handling multi images # overwrite to process with simple pad if len(image_file) > 1: image = [self.process_image(f, "pad") for f in image_file] image = [[im[0], im[1], "image"] for im in image] else: image = [self.process_image(f) for f in image_file] else: image = [self.process_image(image_file)] # sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args) sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args) elif "video" in sources[0]: video_file = self.list_data_dict[i]["video"] try: video, time_msg = self.process_video(video_file, data_anno=self.list_data_dict[i], data_args=self.data_args) # print(video_file, time_msg) processor = self.data_args.image_processor frame_aspect_ratio = self.data_args.frame_aspect_ratio # if frame_aspect_ratio == "anyres" or "anyres_max" in frame_aspect_ratio: if "anyres" in frame_aspect_ratio: if 'nopad' in frame_aspect_ratio: image = process_anyres_video_nopad(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints, max_resolutions=self.data_args.max_num_pixels // len(video)) else: raise NotImplementedError # image = process_anyres_video(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints) else: image = processor.preprocess(video, return_tensors="pt")["pixel_values"] image = [(image, video[0].shape[0:2], "video")] # sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, msg=time_msg) sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args, msg=time_msg) except Exception as e: print(f"Error: {e}") print(f"Failed to read video file: {video_file}") raise e else: # sources = copy.deepcopy([e["conversations"] for e in sources]) # NOTE epoch>1时会出问题,最好提前处理了 sources = [e["conversations"] for e in sources] has_image = ("image" in self.list_data_dict[i]) or ("video" in self.list_data_dict[i]) data_dict = preprocess(sources, self.tokenizer, has_image=has_image) if "prompt" in data_dict: prompt = data_dict["prompt"] else: prompt = None if isinstance(i, int): data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0]) # image exist in the data if "image" in self.list_data_dict[i]: data_dict["image"] = image elif "video" in self.list_data_dict[i]: data_dict["image"] = image elif self.data_args.is_multimodal: # image does not exist in the data, but the model is multimodal crop_size = self.data_args.image_processor.crop_size data_dict["image"] = [ (torch.zeros(1, 3, crop_size["height"], crop_size["width"]), (crop_size["width"], crop_size["height"]), "text"), ] # prompt exist in the data if prompt is not None: data_dict["prompt"] = prompt data_dict["id"] = self.list_data_dict[i].get("id", i) # gc.collect() # NOTE return data_dict @dataclass class DataCollatorForSupervisedDataset(object): """Collate examples for supervised fine-tuning.""" tokenizer: transformers.PreTrainedTokenizer def pad_sequence(self, input_ids, batch_first, padding_value): if self.tokenizer.padding_side == "left": input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids] input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value) if self.tokenizer.padding_side == "left": input_ids = torch.flip(input_ids, [1]) return input_ids def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]: input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels")) # input_ids, labels, ids = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels", "id")) input_ids = [_input_ids[: self.tokenizer.model_max_length] for _input_ids in input_ids] labels = [_labels[: self.tokenizer.model_max_length] for _labels in labels] if self.tokenizer.pad_token_id is None: # self.tokenizer.pad_token_id = self.tokenizer.eos_token_id # FIXME: this could only be triggered for llama3 model. self.tokenizer.pad_token_id = 0 # This gets the best result. Don't know why. input_ids = self.pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id) labels = self.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX) batch = dict(input_ids=input_ids, labels=labels.long() if labels.dtype == torch.int32 else labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id)) # batch = dict(input_ids=input_ids, labels=labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id), ids=ids) if "image" in instances[0]: images = [instance["image"] for instance in instances] # data_format: [image/video, spatial_size, media_type] batch["image_sizes"] = [im[1] for im_list in images for im in im_list] batch["modalities"] = [im[2] for im_list in images for im in im_list] images = [im[0] for im_list in images for im in im_list] # flatten multi-images # 拉平多图应该没有影响,只要后面顺序对的上就行 # use list for input of different lengths # if all(x is not None and x.shape == images[0].shape for x in images): # Image: (N, P, C, H, W) # Video: (N, F, C, H, W) # batch["images"] = torch.stack(images) # else: batch["images"] = images else: # 纯文本数据也会填一个images raise NotImplementedError(instances[0]) if "prompt" in instances[0]: batch["prompts"] = [instance["prompt"] for instance in instances] return batch def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict: """Make dataset and collator for supervised fine-tuning.""" train_dataset = LazySupervisedDataset(tokenizer=tokenizer, data_path=data_args.data_path, data_args=data_args) data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer) return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator) def get_model(model_args, training_args, bnb_model_from_pretrained_args): assert training_args.attn_implementation if training_args.attn_implementation == "sdpa" and torch.__version__ < "2.1.2": raise ValueError("The 'sdpa' attention implementation requires torch version 2.1.2 or higher.") customized_kwargs = dict() customized_kwargs.update(bnb_model_from_pretrained_args) cfg_pretrained = None if ',' in model_args.llm_compress_layer_list: llm_compress_layer_list = [int(i) for i in model_args.llm_compress_layer_list.split(',')] else: llm_compress_layer_list = [int(model_args.llm_compress_layer_list)] llm_image_token_ratio_list = [float(i) for i in model_args.llm_image_token_ratio_list.split(',')] overwrite_config = {"vision_encode_type": model_args.vision_encode_type, "mm_num_compress_latents": model_args.mm_num_compress_latents, "mm_num_compress_query_type": model_args.mm_num_compress_query_type, "mm_pos_num_frames": model_args.mm_pos_num_frames, "mm_local_num_frames": model_args.mm_local_num_frames, "mm_close_init": model_args.mm_close_init, "min_slow_num_frames": model_args.min_slow_num_frames, "mm_llm_compress": model_args.mm_llm_compress, "llm_compress_layer_list": llm_compress_layer_list, "llm_image_token_ratio_list": llm_image_token_ratio_list, "llm_compress_type": model_args.llm_compress_type, "mm_projector_type": model_args.mm_projector_type, "mm_patch_merge_type": model_args.mm_patch_merge_type, "mm_newline_position": model_args.mm_newline_position } if any( [ model_args.rope_scaling_factor is not None, model_args.rope_scaling_type is not None, model_args.mm_spatial_pool_stride is not None, model_args.mm_spatial_pool_out_channels is not None, model_args.mm_spatial_pool_mode is not None, model_args.mm_resampler_type is not None, ] ): if "internlm2" in model_args.model_name_or_path.lower(): cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) else: cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path) else: raise NotImplementedError(model_args) if model_args.use_pos_skipping is not None and model_args.pos_skipping_range is not None: overwrite_config["use_pos_skipping"] = model_args.use_pos_skipping overwrite_config["pos_skipping_range"] = model_args.pos_skipping_range if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None: overwrite_config["rope_scaling"] = { "factor": model_args.rope_scaling_factor, "type": model_args.rope_scaling_type, } if training_args.model_max_length is None: training_args.model_max_length = cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor overwrite_config["max_sequence_length"] = training_args.model_max_length assert training_args.model_max_length == int(cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor), print( f"model_max_length: {training_args.model_max_length}, max_position_embeddings: {cfg_pretrained.max_position_embeddings}, rope_scaling_factor: {model_args.rope_scaling_factor}" ) # overwrite_config["max_sequence_length"] = model_args.max_sequence_length # overwrite_config["tokenizer_model_max_length"] = model_args.tokenizer_model_max_length if model_args.mm_spatial_pool_stride is not None and model_args.mm_spatial_pool_out_channels is not None and model_args.mm_spatial_pool_mode is not None and model_args.mm_resampler_type is not None: overwrite_config["mm_resampler_type"] = model_args.mm_resampler_type overwrite_config["mm_spatial_pool_stride"] = model_args.mm_spatial_pool_stride overwrite_config["mm_spatial_pool_out_channels"] = model_args.mm_spatial_pool_out_channels overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode if model_args.mm_spatial_pool_mode is not None: overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode if overwrite_config: assert cfg_pretrained is not None, "cfg_pretrained is None" rank0_print(f"Overwriting config with {overwrite_config}") for k, v in overwrite_config.items(): setattr(cfg_pretrained, k, v) customized_kwargs["config"] = cfg_pretrained if model_args.model_class_name is not None: raise NotImplementedError(model_args) actual_model_class_name = f"{model_args.model_class_name}ForCausalLM" model_class = getattr(transformers, actual_model_class_name) rank0_print(f"Using model class {model_class} from {model_args.model_class_name}") model = model_class.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) elif model_args.vision_tower is not None: if "mixtral" in model_args.model_name_or_path.lower(): raise ValueError(f"I don't want model class {model_args}") model = LlavaMixtralForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower(): model = LlavaMistralForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) elif ( "wizardlm-2" in model_args.model_name_or_path.lower() or "vicuna" in model_args.model_name_or_path.lower() or "llama" in model_args.model_name_or_path.lower() # or "yi" in model_args.model_name_or_path.lower() # or "nous-hermes" in model_args.model_name_or_path.lower() # and "wizard-2" in model_args.model_name_or_path.lower() ): raise ValueError(f"I don't want model class {model_args}") model = LlavaLlamaForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) elif "qwen" in model_args.model_name_or_path.lower(): if "moe" in model_args.model_name_or_path.lower() or "A14B" in model_args.model_name_or_path: model = LlavaQwenMoeForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock deepspeed.utils.set_z3_leaf_modules(model, [Qwen2MoeSparseMoeBlock]) elif overwrite_config['mm_llm_compress']: model = LlavaQwenForCausalLM_Pdrop.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) else: model = LlavaQwenForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) elif "internlm2" in model_args.model_name_or_path.lower(): if overwrite_config['mm_llm_compress']: raise NotImplementedError # model = LlavaInternLM2ForCausalLM_Pdrop.from_pretrained( # model_args.model_name_or_path, # cache_dir=training_args.cache_dir, # attn_implementation=training_args.attn_implementation, # torch_dtype=(torch.bfloat16 if training_args.bf16 else None), # low_cpu_mem_usage=False, # **customized_kwargs, # ) else: model = LlavaInternLM2ForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) elif "gemma" in model_args.model_name_or_path.lower(): raise ValueError(f"I don't want model class {model_args}") model = LlavaGemmaForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) else: raise ValueError(f"Unknown model class {model_args}") else: raise NotImplementedError model = transformers.LlamaForCausalLM.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), low_cpu_mem_usage=False, **customized_kwargs, ) rank0_print(f"Model config: {model.config}.") return model def train(attn_implementation=None): # global local_rank parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # wandb.init(project="mllm", entity="likunchang", name=os.path.basename(training_args.output_dir), reinit=True) # local_broadcast_process_authkey() # NOTE if training_args.verbose_logging: rank0_print(f"Inspecting experiment hyperparameters:\n") rank0_print(f"model_args = {vars(model_args)}\n\n") rank0_print(f"data_args = {vars(data_args)}\n\n") rank0_print(f"training_args = {vars(training_args)}\n\n") # rank0_print(f"evaluation_args = {vars(evaluation_args)}\n\n") # local_rank = training_args.local_rank compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) bnb_model_from_pretrained_args = {} if training_args.bits in [4, 8]: from transformers import BitsAndBytesConfig bnb_model_from_pretrained_args.update( dict( device_map={"": training_args.device}, load_in_4bit=training_args.bits == 4, load_in_8bit=training_args.bits == 8, quantization_config=BitsAndBytesConfig( load_in_4bit=training_args.bits == 4, load_in_8bit=training_args.bits == 8, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=training_args.double_quant, bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'} ), ) ) model = get_model(model_args, training_args, bnb_model_from_pretrained_args) model.config.use_cache = False if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None: model.config.rope_scaling = { "factor": model_args.rope_scaling_factor, "type": model_args.rope_scaling_type, } if model_args.freeze_backbone: model.model.requires_grad_(False) if training_args.bits in [4, 8]: from peft import prepare_model_for_kbit_training model.config.torch_dtype = torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing) if training_args.gradient_checkpointing: if hasattr(model, "enable_input_require_grads"): model.enable_input_require_grads() else: def make_inputs_require_grad(module, input, output): output.requires_grad_(True) model.get_input_embeddings().register_forward_hook(make_inputs_require_grad) if training_args.lora_enable: from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=training_args.lora_r, lora_alpha=training_args.lora_alpha, target_modules=find_all_linear_names(model), lora_dropout=training_args.lora_dropout, bias=training_args.lora_bias, task_type="CAUSAL_LM", ) if training_args.bits == 16: if training_args.bf16: model.to(torch.bfloat16) if training_args.fp16: model.to(torch.float16) rank0_print("Adding LoRA adapters...") model = get_peft_model(model, lora_config) if "mistral" in model_args.model_name_or_path.lower() or "mixtral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower(): tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="left") elif "qwen" in model_args.model_name_or_path.lower(): tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right") elif "internlm2" in model_args.model_name_or_path.lower(): tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", trust_remote_code=True) elif ( "wizardlm-2" in model_args.model_name_or_path.lower() or "vicuna" in model_args.model_name_or_path.lower() or "llama" in model_args.model_name_or_path.lower() or "yi" in model_args.model_name_or_path.lower() or "nous-hermes" in model_args.model_name_or_path.lower() and "wizard-2" in model_args.model_name_or_path.lower() ): tokenizer = transformers.AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", use_fast=False, ) rank0_print(f"Prompt version: {model_args.version}") if model_args.version == "v0": if tokenizer.pad_token is None: smart_tokenizer_and_embedding_resize( special_tokens_dict=dict(pad_token="[PAD]"), tokenizer=tokenizer, model=model, ) elif model_args.version == "v0.5": tokenizer.pad_token = tokenizer.unk_token else: if tokenizer.unk_token is not None: tokenizer.pad_token = tokenizer.unk_token if model_args.version in conversation_lib.conv_templates: conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version] else: raise NotImplementedError(f"Can't find your conv_templates: {model_args.version}") conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"] if model_args.vision_tower is not None: model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp) vision_tower = model.get_vision_tower() vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device) # NOTE hard code data_args.image_processor = vision_tower.image_processor data_args.is_multimodal = True model.config.image_aspect_ratio = data_args.image_aspect_ratio model.config.frame_aspect_ratio = data_args.frame_aspect_ratio if data_args.image_grid_pinpoints is not None: if isinstance(data_args.image_grid_pinpoints, str) and "x" in data_args.image_grid_pinpoints: try: patch_size = data_args.image_processor.size[0] except Exception as e: patch_size = data_args.image_processor.size["shortest_edge"] assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", data_args.image_grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size data_args.image_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] elif isinstance(data_args.image_grid_pinpoints, str): data_args.image_grid_pinpoints = ast.literal_eval(data_args.image_grid_pinpoints) if data_args.frame_grid_pinpoints is not None: if isinstance(data_args.frame_grid_pinpoints, str) and "x" in data_args.frame_grid_pinpoints: try: patch_size = data_args.image_processor.size[0] except Exception as e: patch_size = data_args.image_processor.size["shortest_edge"] assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]" # Use regex to extract the range from the input string matches = re.findall(r"\((\d+)x(\d+)\)", data_args.frame_grid_pinpoints) range_start = tuple(map(int, matches[0])) range_end = tuple(map(int, matches[-1])) # Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1]) grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)] # Multiply all elements by patch_size data_args.frame_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints] elif isinstance(data_args.frame_grid_pinpoints, str): data_args.frame_grid_pinpoints = ast.literal_eval(data_args.frame_grid_pinpoints) model.config.max_num_pixels = data_args.max_num_pixels model.config.frame_grid_pinpoints = data_args.frame_grid_pinpoints model.config.image_grid_pinpoints = data_args.image_grid_pinpoints model.config.image_crop_resolution = data_args.image_crop_resolution model.config.image_split_resolution = data_args.image_split_resolution model.config.tokenizer_padding_side = tokenizer.padding_side model.config.tokenizer_model_max_length = tokenizer.model_max_length model.config.mm_newline_position = model_args.mm_newline_position ### Deciding train which part of the model if model_args.mm_tunable_parts is None: # traditional way of deciding which part to train model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter model.config.tune_mm_vision_resampler = training_args.tune_mm_vision_resampler = model_args.tune_mm_vision_resampler if model_args.tune_mm_mlp_adapter or model_args.tune_mm_vision_resampler: model.requires_grad_(False) if model_args.tune_mm_mlp_adapter: for p in model.get_model().mm_projector.parameters(): p.requires_grad = True model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter if training_args.freeze_mm_mlp_adapter: for p in model.get_model().mm_projector.parameters(): p.requires_grad = False model.config.freeze_mm_vision_resampler = training_args.freeze_mm_vision_resampler model.config.unfreeze_mm_vision_tower = model_args.unfreeze_mm_vision_tower if model_args.unfreeze_mm_vision_tower: vision_tower.requires_grad_(True) else: vision_tower.requires_grad_(False) else: rank0_print(f"Using mm_tunable_parts: {model_args.mm_tunable_parts}") model.config.mm_tunable_parts = training_args.mm_tunable_parts = model_args.mm_tunable_parts # Set the entire model to not require gradients by default model.requires_grad_(False) vision_tower.requires_grad_(False) model.get_model().mm_projector.requires_grad_(False) # Parse the mm_tunable_parts to decide which parts to unfreeze tunable_parts = model_args.mm_tunable_parts.split(",") if "mm_mlp_adapter" in tunable_parts: for p in model.get_model().mm_projector.parameters(): p.requires_grad = True if "mm_vision_tower" in tunable_parts: for name, param in model.named_parameters(): if "vision_tower" in name: param.requires_grad_(True) if "mm_language_model" in tunable_parts: for name, param in model.named_parameters(): if "vision_tower" not in name and "mm_projector" not in name and "vision_resampler" not in name: param.requires_grad_(True) total_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters()) trainable_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters() if p.requires_grad) rank0_print(f"Total parameters: ~{total_params/1e6:.2f} MB)") rank0_print(f"Trainable parameters: ~{trainable_params/1e6:.2f} MB)") if training_args.bits in [4, 8]: model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device) model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end model.config.mm_projector_lr = training_args.mm_projector_lr model.config.mm_vision_tower_lr = training_args.mm_vision_tower_lr training_args.use_im_start_end = model_args.mm_use_im_start_end model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer) if training_args.bits in [4, 8]: from peft.tuners.lora import LoraLayer for name, module in model.named_modules(): if isinstance(module, LoraLayer): if training_args.bf16: module = module.to(torch.bfloat16) if "norm" in name: module = module.to(torch.float32) if "lm_head" in name or "embed_tokens" in name: if hasattr(module, "weight"): if training_args.bf16 and module.weight.dtype == torch.float32: module = module.to(torch.bfloat16) data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args) trainer = LLaVATrainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) rank0_print(f"model_config after before train: {model.config}") if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")): trainer.train(resume_from_checkpoint=True) else: trainer.train() trainer.save_state() model.config.use_cache = True if training_args.lora_enable: state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias) non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters()) if training_args.local_rank == 0 or training_args.local_rank == -1: if hasattr(model, "config"): model.config.save_pretrained(training_args.output_dir) if hasattr(model, "generation_config"): model.generation_config.save_pretrained(training_args.output_dir) model.save_pretrained(training_args.output_dir, state_dict=state_dict) torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, "non_lora_trainables.bin")) else: safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir) rank0_print(f"Model saved to {training_args.output_dir}") if __name__ == "__main__": train() ================================================ FILE: llava-train_videochat/llava/train/train_mem.py ================================================ from llava.train.train import train from llava.dist_utils import init_distributed_mode if __name__ == "__main__": init_distributed_mode() train() ================================================ FILE: llava-train_videochat/llava/utils.py ================================================ import datetime import logging import logging.handlers import os import sys import numpy as np import requests from llava.constants import LOGDIR server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**" moderation_msg = "I am sorry. Your input may violate our content moderation guidelines. Please avoid using harmful or offensive content." handler = None import torch.distributed as dist try: import av from decord import VideoReader, cpu except ImportError: print("Please install pyav to use video processing functions.") def process_video_with_decord(video_file, data_args): vr = VideoReader(video_file, ctx=cpu(0), num_threads=1) total_frame_num = len(vr) avg_fps = round(vr.get_avg_fps() / data_args.video_fps) frame_idx = [i for i in range(0, total_frame_num, avg_fps)] if data_args.frames_upbound > 0: if len(frame_idx) > data_args.frames_upbound: uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int) frame_idx = uniform_sampled_frames.tolist() video = vr.get_batch(frame_idx).asnumpy() # https://github.com/dmlc/decord/issues/208 vr.seek(0) return video def process_video_with_pyav(video_file, data_args): container = av.open(video_file) # !!! This is the only difference. Using auto threading container.streams.video[0].thread_type = "AUTO" video_frames = [] for packet in container.demux(): if packet.stream.type == 'video': for frame in packet.decode(): video_frames.append(frame) total_frame_num = len(video_frames) video_time = video_frames[-1].time avg_fps = round(total_frame_num / video_time / data_args.video_fps) frame_idx = [i for i in range(0, total_frame_num, avg_fps)] if data_args.frames_upbound > 0: if len(frame_idx) > data_args.frames_upbound: uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int) frame_idx = uniform_sampled_frames.tolist() frames = [video_frames[i] for i in frame_idx] return np.stack([x.to_ndarray(format="rgb24") for x in frames]) def rank0_print(*args): if dist.is_initialized(): if dist.get_rank() == 0: print(f"Rank {dist.get_rank()}: ", *args) else: print(*args) def rank_print(*args): if dist.is_initialized(): print(f"Rank {dist.get_rank()}: ", *args) else: print(*args) def build_logger(logger_name, logger_filename): global handler formatter = logging.Formatter( fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s", datefmt="%Y-%m-%d %H:%M:%S", ) # Set the format of root handlers if not logging.getLogger().handlers: logging.basicConfig(level=logging.INFO) logging.getLogger().handlers[0].setFormatter(formatter) # Redirect stdout and stderr to loggers stdout_logger = logging.getLogger("stdout") stdout_logger.setLevel(logging.INFO) sl = StreamToLogger(stdout_logger, logging.INFO) sys.stdout = sl stderr_logger = logging.getLogger("stderr") stderr_logger.setLevel(logging.ERROR) sl = StreamToLogger(stderr_logger, logging.ERROR) sys.stderr = sl # Get logger logger = logging.getLogger(logger_name) logger.setLevel(logging.INFO) # Add a file handler for all loggers if handler is None: os.makedirs(LOGDIR, exist_ok=True) filename = os.path.join(LOGDIR, logger_filename) handler = logging.handlers.TimedRotatingFileHandler(filename, when="D", utc=True) handler.setFormatter(formatter) for name, item in logging.root.manager.loggerDict.items(): if isinstance(item, logging.Logger): item.addHandler(handler) return logger class StreamToLogger(object): """ Fake file-like stream object that redirects writes to a logger instance. """ def __init__(self, logger, log_level=logging.INFO): self.terminal = sys.stdout self.logger = logger self.log_level = log_level self.linebuf = "" def __getattr__(self, attr): return getattr(self.terminal, attr) def write(self, buf): temp_linebuf = self.linebuf + buf self.linebuf = "" for line in temp_linebuf.splitlines(True): # From the io.TextIOWrapper docs: # On output, if newline is None, any '\n' characters written # are translated to the system default line separator. # By default sys.stdout.write() expects '\n' newlines and then # translates them so this is still cross platform. if line[-1] == "\n": self.logger.log(self.log_level, line.rstrip()) else: self.linebuf += line def flush(self): if self.linebuf != "": self.logger.log(self.log_level, self.linebuf.rstrip()) self.linebuf = "" def disable_torch_init(): """ Disable the redundant torch default initialization to accelerate model creation. """ import torch setattr(torch.nn.Linear, "reset_parameters", lambda self: None) setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None) def violates_moderation(text): """ Check whether the text violates OpenAI moderation API. """ url = "https://api.openai.com/v1/moderations" headers = {"Content-Type": "application/json", "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]} text = text.replace("\n", "") data = "{" + '"input": ' + f'"{text}"' + "}" data = data.encode("utf-8") try: ret = requests.post(url, headers=headers, data=data, timeout=5) flagged = ret.json()["results"][0]["flagged"] except requests.exceptions.RequestException as e: print(f"######################### Moderation Error: {e} #########################") flagged = False except KeyError as e: print(f"######################### Moderation Error: {e} #########################") flagged = False return flagged def pretty_print_semaphore(semaphore): if semaphore is None: return "None" return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})" ================================================ FILE: llava-train_videochat/llava/video_utils.py ================================================ import random import os import io import av import cv2 import decord import imageio from decord import VideoReader import torch import numpy as np import math import gc import torchaudio from torchvision.transforms.functional import pil_to_tensor import re def get_index(num_frames, num_segments): seg_size = float(num_frames - 1) / num_segments start = int(seg_size / 2) offsets = np.array([ start + int(np.round(seg_size * idx)) for idx in range(num_segments) ]) return offsets def lazy_load_s3video(s3path_video, num_frames, video_start, video_end, client): # load video from ceph video_bytes_stream = client.get(s3path_video, enable_stream_lazyloding=True) container = av.open(video_bytes_stream) stream = container.streams.video[0] # duration = stream.duration real_fps = container.streams.video[0].average_rate time_base = container.streams.video[0].time_base start, end = video_start, video_end # Convert time to pts duration_frams = int(end - start) * real_fps frames_index = get_index(duration_frams, num_frames) pts_list = [] start_pts = int((start) / time_base) end_pts = int((end) / time_base) for frame_index in frames_index: pts_list.append(int((frame_index / real_fps)) / time_base) # Seek to nearest key frame from the start container.seek(max(start_pts, 0), stream=stream) frames = [] for frame in container.decode(**{"video":0}): if frame.pts < start_pts: continue # if frame.pts <= end_pts: if len(pts_list) >0: if frame.pts >= pts_list[0]: frames.append(frame) pts_list.pop(0) else: break container.close() frames = [np.array(frames[idx].to_rgb().to_image()) for idx in range(len(frames))] final_frames = np.stack(frames) del frames del video_bytes_stream # T C H W gc.collect() return final_frames, frames_index, float(real_fps) def pts_to_secs(pts: int, time_base: float, start_pts: int) -> float: """ Converts a present time with the given time base and start_pts offset to seconds. Returns: time_in_seconds (float): The corresponding time in seconds. https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/data/utils.py#L54-L64 """ if pts == math.inf: return math.inf return int(pts - start_pts) * time_base def get_pyav_video_duration(video_reader): video_stream = video_reader.streams.video[0] video_duration = pts_to_secs( video_stream.duration, video_stream.time_base, video_stream.start_time ) return float(video_duration) def get_frame_indices(num_frames, vlen, sample='middle', fix_start=None, input_fps=1, min_num_frames=1, max_num_frames=-1, local_num_frames=8): if min_num_frames > vlen: if sample == 'dynamic_fps1': min_num_frames = (vlen // local_num_frames) * local_num_frames else: min_num_frames = vlen if sample == 'dynamic_fps1': duration = float(vlen) / input_fps num_segments = int(duration // local_num_frames) if num_segments == 0: num_frames = local_num_frames else: num_frames = local_num_frames * num_segments if max_num_frames > 0: num_frames = min(num_frames, max_num_frames) sample = "middle" # NOTE # logger.info(f"? is OK (img), duation={duration} frames={num_frames}!!!!") num_frames = max(min_num_frames, num_frames) # print(f"\033[0;31m vlen={vlen}, input_fps={input_fps} num_frames={num_frames} \033[0m") if sample in ["rand", "middle"]: # uniform sampling acc_samples = min(num_frames, vlen) # split the video into `acc_samples` intervals, and sample from each interval. intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int) ranges = [] for idx, interv in enumerate(intervals[:-1]): ranges.append((interv, intervals[idx + 1] - 1)) if sample == 'rand': try: frame_indices = [random.choice(range(x[0], x[1])) for x in ranges] except: frame_indices = np.random.permutation(vlen)[:acc_samples] frame_indices.sort() frame_indices = list(frame_indices) elif fix_start is not None: frame_indices = [x[0] + fix_start for x in ranges] elif sample == 'middle': frame_indices = [(x[0] + x[1]) // 2 for x in ranges] else: raise NotImplementedError if len(frame_indices) < num_frames: # padded with last frame padded_frame_indices = [frame_indices[-1]] * num_frames padded_frame_indices[:len(frame_indices)] = frame_indices frame_indices = padded_frame_indices elif "fps" in sample: # fps0.5, sequentially sample frames at 0.5 fps output_fps = float(sample[3:]) duration = float(vlen) / input_fps delta = 1 / output_fps # gap between frames, this is also the clip length each frame represents frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta) frame_indices = np.around(frame_seconds * input_fps).astype(int) frame_indices = [e for e in frame_indices if e < vlen] if max_num_frames > 0 and len(frame_indices) > max_num_frames: frame_indices = frame_indices[:max_num_frames] # frame_indices = np.linspace(0 + delta / 2, duration + delta / 2, endpoint=False, num=max_num_frames) else: raise ValueError(f"Not support sample type: {sample}") return frame_indices def read_frames_av(video_path, num_frames, sample='rand', client=None, fix_start=None, min_num_frames=1, max_num_frames=-1, clip=None, local_num_frames=8): if clip is not None: raise NotImplementedError("av don't support clip!!!") if 's3://' in video_path: video_bytes = client.get(video_path) byteio = io.BytesIO(video_bytes) byteio.seek(0) reader = av.open(byteio) else: byteio = None reader = av.open(video_path) frames = [f.to_rgb().to_ndarray() for f in reader.decode(video=0)] vlen = len(frames) duration = get_pyav_video_duration(reader) fps = vlen / float(duration) frame_indices = get_frame_indices( num_frames, vlen, sample=sample, fix_start=fix_start, input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames ) frames = np.stack([frames[idx] for idx in frame_indices]) # (T, H, W, C), torch.uint8 # frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8 if byteio != None: byteio.close() reader.close() return frames, frame_indices, float(fps), duration def read_frames_gif( video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1, max_num_frames=-1, client=None, clip=None, local_num_frames=8 ): if clip is not None: raise NotImplementedError("Gif don't support clip!!!") if 's3://' in video_path: video_bytes = client.get(video_path) byteio = io.BytesIO(video_bytes) gif = imageio.get_reader(byteio) else: byteio = None gif = imageio.get_reader(video_path) vlen = len(gif) fps = 1. duration = vlen / fps frame_indices = get_frame_indices( num_frames, vlen, sample=sample, fix_start=fix_start, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames, input_fps=fps # NOTE 写死先 ) frames = [] min_h = min_w = 100000 hw_set = set() for index, frame in enumerate(gif): # for index in frame_idxs: if index in frame_indices: frame = cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB) frame = frame.astype(np.uint8) # # (H x W x C) to (C x H x W) # frame = frame.permute(2, 0, 1) frames.append(frame) hw_set.add(frame.shape) if frame.shape[0] < min_h: min_h = frame.shape[0] if frame.shape[1] < min_w: min_w = frame.shape[1] # print(hw_set, min_h, min_w) if len(hw_set) > 1: frames = [i[:min_h, :min_w] for i in frames] frames = np.stack(frames) # .float() / 255 if byteio != None: byteio.close() return frames, frame_indices, float(fps), duration # for tgif def read_frames_decord( video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1, max_num_frames=-1, client=None, clip=None, local_num_frames=8 ): if video_path.endswith('.avi'): return read_frames_av(video_path=video_path, num_frames=num_frames, sample=sample, fix_start=fix_start, min_num_frames=min_num_frames, max_num_frames=max_num_frames, client=client, clip=clip, local_num_frames=local_num_frames) if 's3://' in video_path: video_bytes = client.get(video_path) if video_bytes is None or len(video_bytes) == 0: raise ValueError(f"Can't read byte from {video_path}!") byteio = io.BytesIO(video_bytes) video_reader = VideoReader(byteio, num_threads=1) else: byteio = None video_reader = VideoReader(video_path, num_threads=1) vlen = len(video_reader) fps = video_reader.get_avg_fps() duration = vlen / float(fps) if clip: start, end = clip start = max(0, start) end = min(duration - 0.1, end) # 防止end超过视频末尾 duration = end - start vlen = int(duration * fps) start_index = int(start * fps) frame_indices = get_frame_indices( num_frames, vlen, sample=sample, fix_start=fix_start, input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames ) if clip: frame_indices = [f + start_index for f in frame_indices] # print(fps, frame_indices) frames = video_reader.get_batch(frame_indices).asnumpy() # (T, H, W, C), torch.uint8 # https://github.com/dmlc/decord/issues/208 video_reader.seek(0) if byteio != None: byteio.close() # frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8 return frames, frame_indices, float(fps), duration def read_frames_img( video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1, max_num_frames=-1, client=None, clip=None, local_num_frames=8 ): def extract_frame_number(filename): # Extract the numeric part from the filename using regular expressions if filename.endswith('.jpg'): match = re.search(r'_(\d+).jpg$', filename) elif filename.endswith('.jpeg'): match = re.search(r'_(\d+).jpeg$', filename) elif filename.endswith('.png'): match = re.search(r'_(\d+).png$', filename) else: raise NotImplementedError(f"Wrong filename: {filename}") return int(match.group(1)) if match else -1 def sort_frames(frame_paths): # Extract filenames from each path and sort by their numeric part return sorted(frame_paths, key=lambda x: extract_frame_number(os.path.basename(x))) # img_list=[] if "s3://" in video_path: img_list = sort_frames(client.list(video_path)) else: img_list = sort_frames(list(os.listdir(video_path))) if 'tvqa' in video_path.lower(): fps = 3.0 # tvqa是3fps的 else: fps = 1.0 # NOTE 未知数据直接当1fps处理 if clip is not None: start = float(clip[0]) end = float(clip[1]) start = max(0, start) end = min(len(img_list) / fps, end) # 防止end超过视频末尾 vlen = (end - start) * fps else: vlen = len(img_list) duration = vlen / fps if min_num_frames > vlen: if sample == 'dynamic_fps1': min_num_frames = (vlen // local_num_frames) * local_num_frames else: min_num_frames = vlen if sample == 'dynamic_fps1': num_segments = int(duration // local_num_frames) if num_segments == 0: num_frames = local_num_frames else: num_frames = local_num_frames * num_segments num_frames = min(num_frames, max_num_frames) num_frames = max(min_num_frames, num_frames) num_frames = int(num_frames) if clip is not None: def _get_index_by_time(start_sec, end_sec, num_segments=8, fps=1., max_frame=9999): start_idx = max(1, round(start_sec * fps)) end_idx = min(round(end_sec * fps), max_frame) seg_size = float(end_idx - start_idx) / (num_segments - 1) offsets = np.array([start_idx + int(np.round(seg_size * idx)) for idx in range(num_segments)]) return offsets frame_indices = _get_index_by_time(float(clip[0]), float(clip[1]), num_segments=num_frames, fps=fps, max_frame=len(img_list)-1) else: frame_indices = get_frame_indices( num_frames, vlen, sample=sample, fix_start=fix_start, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames ) imgs = [] for idx in frame_indices: frame_fname = os.path.join(video_path, img_list[idx]) if "s3://" in video_path: img_bytes = client.get(frame_fname) else: with open(frame_fname, 'rb') as f: img_bytes = f.read() img_np = np.frombuffer(img_bytes, np.uint8) img = cv2.imdecode(img_np, cv2.IMREAD_COLOR) cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img) imgs.append(img) # print(f"\033[0;31m img_list={len(img_list)} video_path={video_path}, len(imgs)={len(imgs)}, frame_indices={frame_indices} num_frames={num_frames} \033[0m") frames = np.array(imgs, dtype=np.uint8) # frames = torch.tensor(np.array(imgs), dtype=torch.uint8).permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8 # logger.info(f"{video_path} is OK (img), duation={vlen}!!!!") return frames, frame_indices, fps, duration # NOTE img直接当1fps处理 def read_frames_fake( video_path, num_frames, sample='rand', fix_start=None, max_num_frames=-1, client=None, clip=None, local_num_frames=8 ): print("I am fake!!!!!!") frame_indices = get_frame_indices( num_frames, 100, sample=sample, fix_start=fix_start, input_fps=1, max_num_frames=max_num_frames, local_num_frames=local_num_frames ) frames = np.random.randint(0, 255, size=(len(frame_indices), 224, 224, 3)) # (T, H, W, C), torch.uint8 return frames, frame_indices, 1.0, 100 VIDEO_READER_FUNCS = { 'av': read_frames_av, 'decord': read_frames_decord, 'gif': read_frames_gif, 'img': read_frames_img, 'frame': read_frames_img, 'lazy': lazy_load_s3video, 'fake': read_frames_fake } ================================================ FILE: llava-train_videochat/pyproject.toml ================================================ [tool.black] line-length = 240 [build-system] requires = ["setuptools>=61.0"] build-backend = "setuptools.build_meta" [project] name = "llava" version = "1.7.0.dev0" description = "LLaVA OneVision: The Next Generation of LLaVA with Better Image and Video Understanding Capabilities" readme = "README.md" requires-python = ">=3.8" classifiers = [ "Programming Language :: Python :: 3", "License :: OSI Approved :: Apache Software License", ] [project.optional-dependencies] standalone = [ "shortuuid", "httpx==0.24.0", "einops", "ftfy", ] train = [ "llava[standalone]", "numpy==1.26.1", "open_clip_torch", "fastapi", "gradio==3.35.2", "markdown2[all]", "numpy", "requests", "sentencepiece", "torch==2.1.2", "torchvision==0.16.2", "uvicorn", "wandb", "deepspeed==0.14.2", "peft==0.4.0", "accelerate>=0.29.1", "tokenizers~=0.15.2", "transformers@git+https://github.com/huggingface/transformers.git@1c39974a4c4036fd641bc1191cc32799f85715a4", "bitsandbytes==0.41.0", "scikit-learn==1.2.2", "sentencepiece~=0.1.99", "einops==0.6.1", "einops-exts==0.0.4", "gradio_client==0.2.9", "urllib3<=2.0.0", "datasets==2.16.1", "pydantic==1.10.8", "timm", "hf_transfer", "opencv-python", "av", "decord", "tyro", "scipy", ] dev0 = [ "llava[standalone]", "open_clip_torch", "fastapi", "markdown2[all]", "uvicorn", "bitsandbytes==0.41.0", "scikit-learn==1.2.2", "datasets==2.16.1", "pydantic==1.10.8", "timm", "hf_transfer", "opencv-python", "av", "decord", "tyro", "scipy", ] [project.urls] "Homepage" = "https://llava-vl.github.io" "Bug Tracker" = "https://github.com/haotian-liu/LLaVA/issues" [tool.setuptools.packages.find] include = ["llava*", "trl*"] exclude = [ "assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*", "checkpoints*", "project_checkpoints*", "debug_checkpoints*", "mlx_configs*", "wandb*", "notebooks*", ] [tool.wheel] exclude = [ "assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*", "checkpoints*", "project_checkpoints*", "debug_checkpoints*", "mlx_configs*", "wandb*", "notebooks*", ] ================================================ FILE: llava-train_videochat/requirements.txt ================================================ Babel==2.14.0 DataProperty==1.0.1 Deprecated==1.2.14 GitPython==3.1.43 Jinja2==3.1.3 Levenshtein==0.25.1 MarkupSafe==2.1.5 PyJWT==2.8.0 PyYAML==6.0.1 Pygments==2.17.2 QtPy==2.4.1 Send2Trash==1.8.3 absl-py==2.1.0 accelerate==0.33.0 addict==2.4.0 aiofiles==23.2.1 aiohttp==3.9.1 aiolimiter==1.2.1 aiosignal==1.3.1 alembic==1.13.0 altair==5.4.1 anls==0.0.2 annotated-types==0.7.0 anthropic==0.45.2 anyio==4.4.0 appdirs==1.4.4 async-timeout==4.0.3 attrs==23.1.0 audioread==3.0.1 av==14.0.1 bitsandbytes==0.41.0 black==24.1.0 blinker==1.7.0 blis==0.7.11 boto3==1.28.25 botocore==1.31.25 Brotli==1.1.0 capture-metric==0.1.13 catalogue==2.0.10 certifi==2023.11.17 cffi==1.16.0 cfgv==3.4.0 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 cloudpathlib==0.19.0 cloudpickle==3.0.0 cmake==3.25.0 colorama==0.4.6 coloredlogs==15.0.1 confection==0.1.5 contourpy==1.2.0 crcmod==1.7 cryptography==43.0.0 ctranslate2==4.4.0 cycler==0.12.1 cymem==2.0.8 databricks-cli==0.18.0 DataProperty==1.0.1 datasets==2.16.1 decorator==4.4.2 decord==0.6.0 deepspeed==0.14.2 dill==0.3.9 distlib==0.3.8 distro==1.9.0 docker==6.1.3 docker-pycreds==0.4.0 docstring_parser==0.16 easydict==1.13 einops==0.6.1 einops-exts==0.0.4 entrypoints==0.4 environs==9.5.0 et-xmlfile==1.1.0 evaluate==0.4.2 exceptiongroup==1.2.2 FactualSceneGraph==0.5.0 fastapi==0.115.6 faster-whisper==1.1.0 ffmpy==0.4.0 filelock==3.14.0 -e git+https://github.com/Dao-AILab/flash-attention.git@9a11f440d3a34f618b4ba814c825b109c6d7e8f5#egg=flash_attn Flask==3.0.0 flatbuffers==24.12.23 fonttools==4.46.0 frozenlist==1.4.0 fsspec==2023.10.0 ftfy==6.1.1 func_timeout==4.3.5 gitdb==4.0.11 GitPython==3.1.40 gradio==5.9.1 gradio_client==1.5.2 greenlet==3.0.2 grpcio==1.66.1 gunicorn==21.2.0 h11==0.14.0 hf_transfer==0.1.8 hjson==3.1.0 httpcore==1.0.7 httpx==0.27.2 httpx-sse==0.4.0 huggingface-hub==0.27.0 humanfriendly==10.0 humanize==4.7.0 identify==2.6.0 idna==3.9 imageio==2.31.1 imageio-ffmpeg==0.5.1 importlib-metadata==7.0.0 itsdangerous==2.1.2 Jinja2==3.1.2 jiter==0.5.0 jmespath==0.10.0 joblib==1.3.2 jsonlines==4.0.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 langcodes==3.4.0 language_data==1.2.0 latex2mathml==3.77.0 lazy_loader==0.4 Levenshtein==0.25.1 librosa==0.10.2.post1 liger-kernel==0.0.0 linkify-it-py==2.0.3 lit==15.0.7 llvmlite==0.43.0 loguru==0.7.2 lxml==5.3.0 Mako==1.3.0 marisa-trie==1.2.0 Markdown==3.5.1 markdown-it-py==2.2.0 markdown2==2.5.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 mbstrdecoder==1.1.3 mdit-py-plugins==0.3.3 mdurl==0.1.2 mlflow==2.9.1 model-index==0.1.11 moviepy==1.0.3 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.4 multiprocess==0.70.17 multiprocessing-logging==0.3.4 murmurhash==1.0.10 mutagen==1.47.0 mypy-extensions==1.0.0 narwhals==1.5.5 networkx==3.2.1 ninja==1.11.1.1 nltk==3.9.1 nodeenv==1.9.1 numba==0.60.0 numexpr==2.10.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 onnxruntime==1.16.3 open_clip_torch==2.26.1 openai==1.44.0 opencv-python==4.8.1.78 opencv-python-headless==4.10.0.84 opendatalab==0.0.10 openmim==0.3.9 openpyxl==3.1.5 openxlab==0.1.1 ordered-set==4.1.0 orjson==3.10.7 oss2==2.17.0 packaging==24.1 pandas==2.1.3 pathos==0.3.3 pathspec==0.12.1 pathtools==0.1.2 pathvalidate==3.2.1 peft==0.4.0 Pillow==10.1.0 platformdirs==4.1.0 pooch==1.8.2 portalocker==2.10.1 pox==0.3.5 ppft==1.7.6.9 pre-commit==3.8.0 preshed==3.0.9 prettytable==3.9.0 proglog==0.1.10 protobuf==3.20.0 psutil==5.9.5 py-cpuinfo==9.0.0 pyarrow==14.0.1 pyarrow-hotfix==0.6 pybind11==2.13.5 pycocoevalcap==1.2 pycocotools==2.0.8 pycparser==2.22 pycryptodome==3.20.0 pycryptodomex==3.20.0 pydantic==2.10.4 pydantic_core==2.27.2 pydub==0.25.1 Pygments==2.18.0 PyJWT==2.8.0 pynvml==11.5.3 pyparsing==3.1.1 pyre-extensions==0.0.29 pytablewriter==1.2.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-hostlist==1.23.0 python-multipart==0.0.20 pytz==2023.3.post1 pywsd==1.2.5 PyYAML==6.0.1 querystring-parser==1.2.4 rapidfuzz==3.9.7 redis==5.0.7 referencing==0.35.1 regex==2023.10.3 reka-api==3.0.8 requests==2.28.2 rfc3986==1.5.0 rich==13.4.2 rouge==1.0.1 rpds-py==0.20.0 ruff==0.8.5 s3transfer==0.6.1 sacrebleu==2.4.3 safehttpx==0.1.6 safetensors==0.4.1 scikit-learn==1.2.2 scipy==1.11.4 seaborn==0.13.2 semantic-version==2.10.0 sentence-transformers==3.0.1 sentencepiece==0.1.99 sentry-sdk==1.29.2 setproctitle==1.3.2 shellingham==1.5.4 shortuuid==1.0.13 shtab==1.7.1 six==1.16.0 smart-open==7.0.4 smmap==5.0.1 sniffio==1.3.1 soundfile==0.12.1 soxr==0.3.7 spaces==0.31.1 spacy==3.7.6 spacy-legacy==3.0.12 spacy-loggers==1.0.5 SQLAlchemy==2.0.23 sqlitedict==2.1.0 sqlparse==0.4.4 srsly==2.4.8 starlette==0.41.3 svgwrite==1.4.3 sympy==1.12 tabledata==1.3.3 tabulate==0.9.0 tcolorpy==0.1.6 tenacity==8.3.0 tensorboard==2.17.1 tensorboard-data-server==0.7.2 tensorboardX==2.6 termcolor==2.3.0 thinc==8.2.5 threadpoolctl==3.2.0 tiktoken==0.7.0 timm==0.4.12 tokenizers==0.19.1 tomli==2.0.1 tomlkit==0.13.2 torch==2.1.2 torchaudio==2.1.2 torchvision==0.16.2 tqdm==4.65.2 tqdm-multiprocess==0.0.11 transformers==4.40.1 transformers-stream-generator==0.0.5 triton==2.1.0 typepy==1.3.2 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 tyro==0.8.10 tzdata==2023.3 uc-micro-py==1.0.3 urllib3==1.26.20 uvicorn==0.30.6 virtualenv==20.26.4 wandb==0.17.9 wasabi==1.1.3 wavedrom==2.0.3.post3 wcwidth==0.2.12 weasel==0.4.1 websocket-client==1.7.0 websockets==13.0 Werkzeug==3.0.1 wn==0.0.23 wrapt==1.16.0 xformers==0.0.20 xxhash==3.4.1 yapf==0.40.2 yarl==1.9.4 yt-dlp==2024.8.6 zipp==3.17.0 zss==1.2.0 zstandard==0.23.0 ================================================ FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_internvideo2_tome16_res224_qwen7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage1_init_connector_iv1m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="internvideo2" VISION_MODEL_VERSION_CLEAN="internvideo2" LLM_VERSION="Qwen/Qwen2_5-7B-Instruct" LLM_VERSION_CLEAN="Qwen2_5_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION=plain BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S") echo "BASE_RUN_NAME: ${BASE_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=8 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_mlp_adapter" \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "no" \ --save_steps 50000 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 16 \ --lazy_preprocess True \ --report_to tensorboard \ --run_name $BASE_RUN_NAME \ --attn_implementation sdpa \ --frames_upbound 4 \ --time_msg short \ --local_num_frames 4 \ --sample_type middle \ --vision_encode_type video_image \ --mm_pos_num_frames 4 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage1_init_connector_iv1m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-large" VISION_MODEL_VERSION_CLEAN="umt-large" LLM_VERSION="Qwen/Qwen2-7B-Instruct" LLM_VERSION_CLEAN="Qwen2_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION=plain BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S") echo "BASE_RUN_NAME: ${BASE_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=8 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_mlp_adapter" \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "no" \ --save_steps 50000 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 16 \ --lazy_preprocess True \ --report_to tensorboard \ --run_name $BASE_RUN_NAME \ --attn_implementation sdpa \ --frames_upbound 4 \ --time_msg short \ --local_num_frames 4 \ --sample_type middle \ --vision_encode_type video_image \ --mm_pos_num_frames 4 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_umt_tome16_res448_qwen1_5b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage1_init_connector_iv1m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-hd-large" VISION_MODEL_VERSION_CLEAN="umt-hd-large" LLM_VERSION="Qwen/Qwen2_5-1.5B-Instruct" LLM_VERSION_CLEAN="Qwen2_5_1_5B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION=plain BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S") echo "BASE_RUN_NAME: ${BASE_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=8 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_mlp_adapter" \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "no" \ --save_steps 50000 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 16 \ --lazy_preprocess True \ --report_to tensorboard \ --run_name $BASE_RUN_NAME \ --attn_implementation sdpa \ --frames_upbound 4 \ --time_msg short \ --local_num_frames 4 \ --sample_type middle \ --vision_encode_type video_image \ --mm_pos_num_frames 4 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_internvideo2_tome16_res224_qwen_7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="internvideo2" VISION_MODEL_VERSION_CLEAN="internvideo2" LLM_VERSION="Qwen/Qwen2_5-7B-Instruct" LLM_VERSION_CLEAN="Qwen2_5_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 12000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --attn_implementation sdpa \ --frames_upbound 8 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_close_init True \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-large" VISION_MODEL_VERSION_CLEAN="umt-large" LLM_VERSION="Qwen/Qwen2-7B-Instruct" LLM_VERSION_CLEAN="Qwen2_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 12000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --attn_implementation sdpa \ --frames_upbound 8 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_close_init True \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res448_qwen_1_5b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-hd-large" VISION_MODEL_VERSION_CLEAN="umt-hd-large" LLM_VERSION="Qwen/Qwen2_5-1.5B-Instruct" LLM_VERSION_CLEAN="Qwen2_5_1_5B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 12000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --attn_implementation sdpa \ --frames_upbound 8 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_close_init True \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_internvideo2_tome16_res224_qwen_7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage3_short-long_mix_sft.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="internvideo2" VISION_MODEL_VERSION_CLEAN="internvideo2" LLM_VERSION="Your_stage2_checkpoint_path" LLM_VERSION_CLEAN="Qwen2_5_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 10000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 6 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --frames_upbound 512 \ --frames_lowbound 64 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage3_short-long_mix_sft.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-large" VISION_MODEL_VERSION_CLEAN="umt-large" LLM_VERSION="Your_stage2_checkpoint_path" LLM_VERSION_CLEAN="Qwen2_7B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 10000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 6 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --frames_upbound 512 \ --frames_lowbound 64 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_umt_tome16_res448_qwen_1_5b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage3_short-long_mix_sft.yaml" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-hd-large" VISION_MODEL_VERSION_CLEAN="umt-hd-large" LLM_VERSION="Your_stage2_checkpoint_path" LLM_VERSION_CLEAN="Qwen2_5_1_5B" mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 10000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 6 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --frames_upbound 512 \ --frames_lowbound 64 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh ================================================ export OMP_NUM_THREADS=1 export DISABLE_ADDMM_CUDA_LT=1 export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1 DATA_VERSION="data/stage4_highres_postsft" DATA_VERSION_CLEAN=$(basename "$DATA_VERSION") VISION_MODEL_VERSION="umt-hd-large" VISION_MODEL_VERSION_CLEAN="umt-hd-large" # NOTE Please modify vision_tower="umt-hd-large" in Your_stage3_checkpoint_path/config.json first! LLM_VERSION_CLEAN="Qwen2_7B" LLM_VERSION="Your_stage3_checkpoint_path" LLM_VERSION_CLEAN=$(basename "$LLM_VERSION") mm_projector_type=tome16_mlp_hd64 PROMPT_VERSION="qwen_2" MID_RUN_NAME=stage4-${VISION_MODEL_VERSION_CLEAN}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S") echo "MID_RUN_NAME: ${MID_RUN_NAME}" PARTITION='video' JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") NUM_GPU=32 # NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command. srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --ntasks=${NUM_GPU} \ --gres=gpu:8 \ --ntasks-per-node=8 \ --cpus-per-task=16 \ --kill-on-bad-exit=1 \ python -u llava/train/train_mem.py \ --deepspeed scripts/zero1.json \ --model_name_or_path ${LLM_VERSION} \ --version ${PROMPT_VERSION} \ --data_path ${DATA_VERSION} \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter" \ --mm_vision_tower_lr=2e-6 \ --mm_vision_select_layer -2 \ --mm_projector_type ${mm_projector_type} \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_nopad \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_nopad \ --mm_newline_position nothing \ --bf16 True \ --run_name $MID_RUN_NAME \ --output_dir ./checkpoints/stage4-highres_postsft/${MID_RUN_NAME} \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 8000 \ --save_total_limit 1 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to tensorboard \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --frames_upbound 512 \ --frames_lowbound 64 \ --time_msg short \ --local_num_frames 4 \ --vision_encode_type video_image \ --sample_type dynamic_fps1 \ --mm_local_num_frames 4 \ --verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log # You can delete the sdpa attn_implementation if you want to use flash attn ================================================ FILE: llava-train_videochat/scripts/zero1.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 1, "reduce_bucket_size": 500000000.0 }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: llava-train_videochat/scripts/zero2.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: llava-train_videochat/scripts/zero2_fused_adamw.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: llava-train_videochat/scripts/zero2_offload.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } } ================================================ FILE: llava-train_videochat/scripts/zero3.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: llava-train_videochat/scripts/zero3_offload.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5, "wall_clock_breakdown": false } ================================================ FILE: llava-train_videochat/scripts/zero3pp.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "zero_quantized_weights": true, "zero_hpz_partition_size": 16, "zero_quantized_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: lmms-eval_videochat/.gitignore ================================================ env *.pyc output/ data/ lm_cache .idea build dist *.egg-info venv .vscode/ temp __pycache__ .ipynb_checkpoints temp .DS_STORE # IPython profile_default/ ipython_config.py logs/ wandb/ SimSun.ttf submissions/ lmms_eval/tasks/hallusion_bench/hallusion_output_vs_model.json lmms_eval/tasks/hallusion_bench/hallusion_output_vd_model.json zk.log cache_dir ckpt pretrained/ LLaVA/ *logs temp/ logs/ data/ ================================================ FILE: lmms-eval_videochat/.pre-commit-config.yaml ================================================ repos: - repo: https://github.com/psf/black rev: 23.12.1 hooks: - id: black language_version: python3 ================================================ FILE: lmms-eval_videochat/LICENSE ================================================ # For the main pipeline structure-related code, we maintain the original license provided with lm-evaluation-harness, which is the MIT License. MIT License Copyright (c) 2024 LMMs-Lab Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. # For the multimodal models and datasets that we have added (defined as code in the lmms_eval/tasks and lmms_eval/models folders), we apply the Apache License. Apache 2.0 License Copyright (c) 2024 LMMs-Lab Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. When modifying the code, please include the following information about the original lmms-eval source: # Adopted from lmms-eval from https://github.com/EvolvingLMMs-Lab/lmms-eval. Below is the original copyright: # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ================================================ FILE: lmms-eval_videochat/README.md ================================================ # How to use We have modified the data loading method for lmms-eval: instead of loading from Huggingface, the data is loaded locally. Therefore, when using it, you need to **specify the data path** in the YAML file of each task. The data can be downloaded from the [lmms-eval](https://huggingface.co/lmms-lab) or the official repos of the corresponding tasks. ## Installation You can install the package by cloning the repository and running the following command: ```bash git clone https://github.com/OpenGVLab/VideoChat-Flash cd lmms-eval_videochat pip install -e . ``` We provide all evaluation [scripts](scripts) and [annotations](eval_annotations) here. You could evaluate one task: ```bash TASK=mvbench MODEL_NAME=videochat_flash MAX_NUM_FRAMES=512 CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") MASTER_PORT=$((18000 + $RANDOM % 100)) NUM_GPUS=8 accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \ --model ${MODEL_NAME} \ --model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES} ``` You could evaluate more tasks once like: ```bash TASK=videomme,videomme_w_subtitle MODEL_NAME=videochat_flash MAX_NUM_FRAMES=512 CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S") MASTER_PORT=$((18000 + $RANDOM % 100)) NUM_GPUS=8 accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \ --model ${MODEL_NAME} \ --model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES} ``` We provide our [evaluation log](https://github.com/OpenGVLab/VideoChat-Flash/blob/main/lmms-eval_videochat/videochat-flash-7B%40448_eval_log_videomme.json) of videomme for your reproducibility. ================================================ FILE: lmms-eval_videochat/docs/README.md ================================================ # LMMs Eval Documentation Welcome to the docs for `lmms-eval`! Majority of this documentation is adapted from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/) ## Table of Contents * To learn about the command line flags, see the [commands](commands.md) * To learn how to add a new moddel, see the [Model Guide](model_guide.md). * For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md). * If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools) ================================================ FILE: lmms-eval_videochat/docs/commands.md ================================================ # User Guide This document details the interface exposed by `lmms_eval` and provides details on what flags are available to users. ## Command-line Interface Equivalently, running the library can be done via the `lmms_eval` entrypoint at the command line. This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`: * `--model` : Selects which model type or provider is evaluated. Must be a mdoels registered under lmms_eval/models. For example, `--model qwen_vl` or `--model llava`. * `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=liuhaotian/llava-v1.5-7b,batch_size=1`. For a full list of what keyword arguments, see the initialization of the corresponding model class in `lmms_eval/models/`. * `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. You can use `--tasks list` to see all the available tasks. If you add your own tasks but not shown on the list, you can try to set `--verbosity=DEBUG` to view the error message. You can also use `--tasks list_with_num` to check every tasks and the number of question each task contains. However, `list_with_num` will download all the available datasets and may require lots of memory and time. * `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length. * `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well. * `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`. * `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models. ## Usage with SRT API > install sglang ```bash git clone https://github.com/sgl-project/sglang.git # Current version is tested on #1222 cd sglang; pip install -e "python[srt]" # Install FlashInfer CUDA kernels pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ ``` > run sglang backend service with the following command ```bash # After update, there is no need to use an extra command to setup backend server # the server will be initialized in the init process # launch lmms-eval srt_api model CKPT_PATH=$1 TASK=$2 MODALITY=$3 TP_SIZE=$4 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX python3 -m lmms_eval \ --model srt_api \ --model_args modality=$MODALITY,model_version=$CKPT_PATH,tp=$TP_SIZE,host=127.0.0.1,port=30000,timeout=600 \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` You may need to install some dependencies for the above command to work (if you encounter some errors). ```bash pip install httpx==0.23.3 pip install protobuf==3.20 ``` ================================================ FILE: lmms-eval_videochat/docs/current_tasks.md ================================================ # Current Tasks > () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file. > The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names. - AI2D (ai2d) - ChartQA (chartqa) - CMMMU (cmmmu) - CMMMU Validation (cmmmu_val) - CMMMU Test (cmmmu_test) - COCO Caption (coco_cap) - COCO 2014 Caption (coco2014_cap) - COCO 2014 Caption Validation (coco2014_cap_val) - COCO 2014 Caption Test (coco2014_cap_test) - COCO 2017 Caption (coco2017_cap) - COCO 2017 Caption MiniVal (coco2017_cap_val) - COCO 2017 Caption MiniTest (coco2017_cap_test) - [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench) - DOCVQA (docvqa) - DOCVQA Validation (docvqa_val) - DOCVQA Test (docvqa_test) - Ferret (ferret) - Flickr30K (flickr30k) - Ferret Test (ferret_test) - GQA (gqa) - HallusionBenchmark (hallusion_bench_image) - Infographic VQA (info_vqa) - Infographic VQA Validation (info_vqa_val) - Infographic VQA Test (info_vqa_test) - LLaVA-Bench (llava_in_the_wild) - LLaVA-Bench-COCO (llava_bench_coco) - MathVerse (mathverse) - MathVerse Text Dominant (mathverse_testmini_text_dominant) - MathVerse Text Only (mathverse_testmini_text_only) - MathVerse Text Lite (mathverse_testmini_text_lite) - MathVerse Vision Dominant (mathverse_testmini_vision_dominant) - MathVerse Vision Intensive (mathverse_testmini_vision_intensive) - MathVerse Vision Only (mathverse_testmini_vision_only) - MathVista (mathvista) - MathVista Validation (mathvista_testmini) - MathVista Test (mathvista_test) - MMBench (mmbench) - MMBench English (mmbench_en) - MMBench English Dev (mmbench_en_dev) - MMBench English Test (mmbench_en_test) - MMBench Chinese (mmbench_cn) - MMBench Chinese Dev (mmbench_cn_dev) - MMBench Chinese Test (mmbench_cn_test) - MME (mme) - MMMU (mmmu) - MMMU Validation (mmmu_val) - MMMU Test (mmmu_test) - MMStar (mmstar) - MMUPD (mmupd) - MMUPD Base (mmupd_base) - MMAAD Base (mmaad_base) - MMIASD Base (mmiasd_base) - MMIVQD Base (mmivqd_base) - MMUPD Option (mmupd_option) - MMAAD Option (mmaad_option) - MMIASD Option (mmiasd_option) - MMIVQD Option (mmivqd_option) - MMUPD Instruction (mmupd_instruction) - MMAAD Instruction (mmaad_instruction) - MMIASD Instruction (mmiasd_instruction) - MMIVQD Instruction (mmivqd_instruction) - MMVet (mmvet) - Multi-DocVQA (multidocvqa) - Multi-DocVQA Validation (multidocvqa_val) - Multi-DocVQA Test (multidocvqa_test) - NoCaps (nocaps) - NoCaps Validation (nocaps_val) - NoCaps Test (nocaps_test) - OKVQA (ok_vqa) - OKVQA Validation 2014 (ok_vqa_val2014) - POPE (pope) - RefCOCO (refcoco) - refcoco_seg_test - refcoco_seg_val - refcoco_seg_testA - refcoco_seg_testB - refcoco_bbox_test - refcoco_bbox_val - refcoco_bbox_testA - refcoco_bbox_testB - RefCOCO+ (refcoco+) - refcoco+_seg - refcoco+_seg_val - refcoco+_seg_testA - refcoco+_seg_testB - refcoco+_bbox - refcoco+_bbox_val - refcoco+_bbox_testA - refcoco+_bbox_testB - RefCOCOg (refcocog) - refcocog_seg_test - refcocog_seg_val - refcocog_bbox_test - refcocog_bbox_val - ScienceQA (scienceqa_full) - ScienceQA Full (scienceqa) - ScienceQA IMG (scienceqa_img) - ScreenSpot (screenspot) - ScreenSpot REC / Grounding (screenspot_rec) - ScreenSpot REG / Instruction Generation (screenspot_reg) - SeedBench (seedbench) - SeedBench 2 (seedbench_2) - SeedBench 2 Plus (seedbench_2_plus) - ST-VQA (stvqa) - TextCaps (textcaps) - TextCaps Validation (textcaps_val) - TextCaps Test (textcaps_test) - TextVQA (textvqa) - TextVQA Validation (textvqa_val) - TextVQA Test (textvqa_test) - VizWizVQA (vizwiz_vqa) - VizWizVQA Validation (vizwiz_vqa_val) - VizWizVQA Test (vizwiz_vqa_test) - VQAv2 (vqav2) - VQAv2 Validation (vqav2_val) - VQAv2 Test (vqav2_test) - WebSRC (websrc) - WebSRC Validation (websrc_val) - WebSRC Test (websrc_test) ================================================ FILE: lmms-eval_videochat/docs/model_guide.md ================================================ # New Model Guide In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lmms_eval.api.model.lmms` class, that defines how the lmms_eval should interface with your model. This guide walks through how to write this `lmms` subclass via adding it to the library! ## Setup To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment: ```sh # After forking... git clone https://github.com//lmms-eval.git cd lmms-eval git checkout -b pip install -e . ``` Now, we'll create a new file where we'll be adding our model: ```sh touch lmms_eval/models/.py ``` **As a rule of thumb, we recommend you to use `lmms_eval/models/qwen_vl.py` and `lmms_eval/models/instructblip.py` as reference implementations for your model. You can copy and paste the contents of one of these files into your new file to get started.** ## Interface All models must subclass the `lmms_eval.api.model.lmms` class. The lmms class enforces a common interface via which we can extract responses from a model: ```python class MyCustomLM(lmms): #... def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]: #... def generate_until(self, requests: list[Instance]) -> list[str]: #... #... ``` Where `Instance` is a dataclass defined in [`lmms_eval.api.instance`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/api/instance.py) with property `args` of request-dependent type signature described below. We support three types of requests, consisting of different interactions / measurements with an autoregressive LM. All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name. Overall, you can check the [construct_requests](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/api/task.py#L918) to see how the arguments are being constructed for different types of output type requests. - `generate_until` - Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters. - In each `Instance.args` there will be 6 elements which are `contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split`. `contexts` refers to the formatted question and is the text input for the LMM. Sometimes it might contains image token and need to address differently for different models. `all_gen_kwargs` refers to the dict that contains all the generation configuration for the model. We use `doc_id`, `task`, and `split` to access the dataset and then you can use `doc_to_visual` which is a function reference to process the image. When you implement your own model, you should use these to write your own generate_util function. - Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, `{"until": ["\n\n", "."], "max_gen_toks": 128}`). - The generated input+output text from the model will then be returned. - `loglikelihood` - Each request contains `Instance.args : Tuple[str, str]` containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned. - In each `Instance.args` there will be 6 elements which are ` contexts, doc_to_target, doc_to_visual, doc_id, task, split`. `contexts` refers to the formatted question and is the text input for the LMM. Sometimes it might contains image token and need to address differently for different models. `doc_to_target` is a function reference that get the get the answer from the doc. This will be the continuation of the answer and only tokens belong to this part should be calculated for the loglikelihood. - Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the target string is the *most likely* N-token string to be output by the LM given the input. ) ## Registration Congrats on implementing your model! Now it's time to test it out. To make your model usable via the command line interface to `lmms_eval`, you'll need to tell `lmms_eval` what your model's name is. This is done via a *decorator*, `lmms_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python -m lm_eval --model ` and alert `lmms_eval` to the model's existence. ```python from lmms_eval.api.registry import register_model @register_model("", "") class MyCustomLM(LM): ``` The final step is to import your model in `lmms_eval/models/__init__.py`: ```python from .my_model_filename import MyCustomLM ``` ================================================ FILE: lmms-eval_videochat/docs/run_examples.md ================================================ # User Guide This document details the running examples for different models in `lmms_eval`. We include commandas on how to prepare environments for different model and some commands to run these models ## Environmental Variables Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run. ```bash export OPENAI_API_KEY="" export HF_HOME="" export HF_TOKEN="" export HF_HUB_ENABLE_HF_TRANSFER="1" export REKA_API_KEY="" # Other possible environment variables include # ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc. ``` ## Some common environment issue Sometimes you might encounter some common issues for example error related to `httpx` or `protobuf`. To solve these issues, you can first try ```bash python3 -m pip install httpx==0.23.3; python3 -m pip install protobuf==3.20; # If you are using numpy==2.x, sometimes may causing errors python3 -m pip install numpy==1.26; # Someties sentencepiece are required for tokenizer to work python3 -m pip install sentencepiece; ``` # Image Model ### LLaVA First, you will need to clone repo of `lmms_eval` and repo of [`llava`](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference) ```bash cd /path/to/lmms-eval python3 -m pip install -e .; cd /path/to/LLaVA-NeXT; python3 -m pip install -e ".[train]"; TASK=$1 CKPT_PATH=$2 CONV_TEMPLATE=$3 MODEL_NAME=$4 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX #mmbench_en_dev,mathvista_testmini,llava_in_the_wild,mmvet accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model llava \ --model_args pretrained=$CKPT_PATH,conv_template=$CONV_TEMPLATE,model_name=$MODEL_NAME \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` If you are trying to use large LLaVA models such as LLaVA-NeXT-Qwen1.5-72B, you can try adding `device_map=auto` in model_args and change `num_processes` to 1. ### IDEFICS2 You won't need to clone any other repos to run idefics. Making sure your transformers version supports idefics2 would be enough ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install transformers --upgrade; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model idefics2 \ --model_args pretrained=HuggingFaceM4/idefics2-8b \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### InternVL2 ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install flash-attn --no-build-isolation; python3 -m pip install torchvision einops timm sentencepiece; TASK=$1 CKPT_PATH=$2 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12380 -m lmms_eval \ --model internvl2 \ --model_args pretrained=$CKPT_PATH \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### InternVL-1.5 First you need to fork [`InternVL`](https://github.com/OpenGVLab/InternVL) ```bash cd /path/to/lmms-eval python3 -m pip install -e .; cd /path/to/InternVL/internvl_chat python3 -m pip install -e .; python3 -m pip install flash-attn==2.3.6 --no-build-isolation; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model internvl \ --model_args pretrained="OpenGVLab/InternVL-Chat-V1-5"\ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### Xcomposer-4KHD and Xcomposer-2d5 Both of these two models does not require external repo ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install flash-attn --no-build-isolation; python3 -m pip install torchvision einops timm sentencepiece; TASK=$1 MODALITY=$2 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX # For Xcomposer2d5 accelerate launch --num_processes 8 --main_process_port 10000 -m lmms_eval \ --model xcomposer2d5 \ --model_args pretrained="internlm/internlm-xcomposer2d5-7b",device="cuda",modality=$MODALITY\ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ # For Xcomposer-4kHD accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model xcomposer2_4khd \ --model_args pretrained="internlm/internlm-xcomposer2-4khd-7b" \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### InstructBLIP ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install transformers --upgrade; CKPT_PATH=$1 TASK=$2 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model instructblip \ --model_args pretrained=$CKPT_PATH \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix instructblip \ --output_path ./logs/ ``` ### SRT API MODEL To enable faster testing speed for larger llava model, you can use this srt api model to enable testing through sglang. You will need to first glone sglang from "https://github.com/sgl-project/sglang". Current version is tested on the commit #1222 of sglang Here are the scripts if you want to test the result in one script. ```bash cd /path/to/lmms-eval python3 -m pip install -e .; cd /path/to/sglang; python3 -m pip install -e "python[all]"; python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ CKPT_PATH=$1 TASK=$2 MODALITY=$3 TP_SIZE=$4 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX python3 -m lmms_eval \ --model srt_api \ --model_args modality=$MODALITY,model_version=$CKPT_PATH,tp=$TP_SIZE,host=127.0.0.1,port=30000,timeout=600 \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` You can use the script in `sglang` under `test` folder to kill all sglang service # API Model ### GPT ```bash cd /path/to/lmms-eval python3 -m pip install -e .; export OPENAI_API_KEY="" TASK=$1 MODEL_VERSION=$2 MODALITIES=$3 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 30000 -m lmms_eval \ --model gpt4v \ --model_args model_version=$MODEL_VERSION,modality=$MODALITIES\ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### Claude ```bash cd /path/to/lmms-eval python3 -m pip install -e .; export ANTHROPIC_API_KEY="" TASK=$1 MODEL_VERSION=$2 MODALITIES=$3 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model claude \ --model_args model_version=$MODEL_VERSION\ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` # Video Model ### LLaVA-VID ```bash cd /path/to/lmms-eval python3 -m pip install -e .; cd /path/to/LLaVA-NeXT; python3 -m pip install -e ".[train]"; python3 -m pip install flash-attn --no-build-isolation; python3 -m pip install av; TASK=$1 CKPT_PATH=$2 CONV_TEMPLATE=$3 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model llavavid \ --model_args pretrained=$CKPT_PATH,conv_template=$CONV_TEMPLATE,video_decode_backend=decord,max_frames_num=32 \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### LLaMA-VID ```bash cd /path/to/lmms-eval python3 -m pip install -e .; # Notice that you should not leave the folder of LLaMA-VID when calling lmms-eval # Because they left their processor's config inside the repo cd /path/to/LLaMA-VID; python3 -m pip install -e . python3 -m pip install av sentencepiece; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model llama_vid \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### Video-LLaVA ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install transformers --upgrade; python3 -m pip install av sentencepiece; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model video_llava \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### MPlug-Owl Notice that this model will takes long time to load, please be patient :) ```bash cd /path/to/lmms-eval python3 -m pip install -e .; # It has to use an old transformers version to run python3 -m pip install av sentencepiece protobuf==3.20 transformers==4.28.1 einops; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model mplug_owl_video \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ### Video-ChatGPT ```bash cd /path/to/lmms-eval python3 -m pip install -e .; python3 -m pip install sentencepiece av; TASK=$1 echo $TASK TASK_SUFFIX="${TASK//,/_}" echo $TASK_SUFFIX accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \ --model video_chatgpt \ --tasks $TASK \ --batch_size 1 \ --log_samples \ --log_samples_suffix $TASK_SUFFIX \ --output_path ./logs/ ``` ================================================ FILE: lmms-eval_videochat/docs/task_guide.md ================================================ # Task Configuration The `lmms_eval` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format. These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lmms_eval` task implementations. While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users. ## Good Reference Tasks Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include: Generation-based tasks: - MME (`lmms_eval/tasks/mme/mme.yaml`) ```yaml dataset_path: lmms-lab/MME dataset_kwargs: token: True task: "mme" test_split: test output_type: generate_until doc_to_visual: !function utils.mme_doc_to_visual doc_to_text: !function utils.mme_doc_to_text doc_to_target: "answer" generation_kwargs: max_new_tokens: 16 temperature: 0 top_p: 1.0 num_beams: 1 do_sample: false # The return value of process_results will be used by metrics process_results: !function utils.mme_process_results # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results metric_list: - metric: mme_percetion_score aggregation: !function utils.mme_aggregate_results higher_is_better: true - metric: mme_cognition_score aggregation: !function utils.mme_aggregate_results higher_is_better: true lmms_eval_specific_kwargs: default: pre_prompt: "" post_prompt: "\nAnswer the question using a single word or phrase." qwen_vl: pre_prompt: "" post_prompt: " Answer:" metadata: - version: 0.0 ``` You can pay special attention to the `process_results` and `metric_list` fields, which are used to define how the model output is post-processed and scored. Also, the `lmms_eval_specific_kwargs` field is used to define model-specific prompt configurations. The default is set to follow Llava. PPL-based tasks: - Seedbench (`lmms_eval/tasks/seedbench/seedbench_ppl.yaml`) ```yaml dataset_path: lmms-lab/SEED-Bench dataset_kwargs: token: True task: "seedbench_ppl" test_split: test output_type: multiple_choice doc_to_visual: !function utils.seed_doc_to_visual doc_to_text: !function utils.seed_doc_to_text_mc doc_to_choice : !function utils.seed_doc_to_choice doc_to_target: !function utils.seed_doc_to_mc_target # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results metric_list: - metric: acc metadata: - version: 0.0 ``` ## Configurations Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task. ### Parameters Task naming + registration: - **task** (`str`, defaults to None) — name of the task. - **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once. Dataset configuration options: - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. - **dataset_name** (`str`, *optional*, defaults to None) — The name of what HF calls a “config” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.) - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv. - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split. - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split. - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split. - **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. **This function is not well tested so far** - **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template. Prompting / in-context formatting options: - **doc_to_text** (`Union[Callable, str]`, *optional*) — Column name or function to process a sample into the appropriate input for the model - **doc_to_visial** (`Union[Callable, str]`, *optional*) — Function to process a sample into the appropriate input images for the model. - **doc_to_target** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks. Runtime configuration options: - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input. **This function is not well tested so far** - **batch_size** (`int`, *optional*, defaults to 1) — Batch size. **So far some models (such as qwen) may not support batch size > 1. Some models (such as llava) will generate different scores for different batch sizes. We recommend setting batch size to 1 for final benchmarking runs.** Scoring details: - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, and `multiple_choice`. - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes. ================================================ FILE: lmms-eval_videochat/eval_annotations/LVBench/README.md ================================================ --- license: mit extra_gated_prompt: >- You agree to not use the dataset to conduct experiments that cause harm to human subjects. Please note that the data in this dataset may be subject to other agreements. Before using the data, be sure to read the relevant agreements carefully to ensure compliant use. Video copyrights belong to the original video creators or platforms and are for academic research use only. task_categories: - visual-question-answering extra_gated_fields: Name: text Company/Organization: text Country: text E-Mail: text modalities: - Video - Text configs: - config_name: lvbench data_files: json/lvbench_clean.json - config_name: lvbench_cartoon data_files: json/lvbench_clean_cartoon.json - config_name: lvbench_documentary data_files: json/lvbench_clean_documentary.json - config_name: lvbench_live data_files: json/lvbench_clean_live.json - config_name: lvbench_selfmedia data_files: json/lvbench_clean_selfmedia.json - config_name: lvbench_sport data_files: json/lvbench_clean_sport.json - config_name: lvbench_tv data_files: json/lvbench_clean_tv.json language: - en size_categories: - 1K